Library Instalation & Imports¶


In [8]:
!pip install -U sentence-transformers gensim transformers tqdm -q --quiet
!pip install unidecode gensim -q --quiet
!pip install scikeras --quiet
!pip install keras --quiet
!pip install deep-translator --quiet
!pip uninstall tensorflow keras tensorflow-addons -y --quiet
!pip install tensorflow==2.15.0 --quiet
!pip install tensorflow-addons==0.22.0 --quiet
!wget https://raw.githubusercontent.com/jasonwei20/eda_nlp/master/code/eda.py --quiet
!pip install keras-cv-attention-models  --quiet
!pip install bertopic
!pip install nlpaug

import os
#os.kill(os.getpid(), 9)
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.15.0 requires keras<2.16,>=2.15.0, but you have keras 3.10.0 which is incompatible.
tf-keras 2.18.0 requires tensorflow<2.19,>=2.18, but you have tensorflow 2.15.0 which is incompatible.
tensorflow-decision-forests 1.11.0 requires tensorflow==2.18.0, but you have tensorflow 2.15.0 which is incompatible.
tensorflow-text 2.18.1 requires tensorflow<2.19,>=2.18.0, but you have tensorflow 2.15.0 which is incompatible.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scikeras 0.13.0 requires keras>=3.2.0, but you have keras 2.15.0 which is incompatible.
tf-keras 2.18.0 requires tensorflow<2.19,>=2.18, but you have tensorflow 2.15.0 which is incompatible.
tensorflow-decision-forests 1.11.0 requires tensorflow==2.18.0, but you have tensorflow 2.15.0 which is incompatible.
tensorflow-text 2.18.1 requires tensorflow<2.19,>=2.18.0, but you have tensorflow 2.15.0 which is incompatible.
Requirement already satisfied: bertopic in /usr/local/lib/python3.11/dist-packages (0.17.0)
Requirement already satisfied: hdbscan>=0.8.29 in /usr/local/lib/python3.11/dist-packages (from bertopic) (0.8.40)
Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.11/dist-packages (from bertopic) (1.26.4)
Requirement already satisfied: pandas>=1.1.5 in /usr/local/lib/python3.11/dist-packages (from bertopic) (2.2.2)
Requirement already satisfied: plotly>=4.7.0 in /usr/local/lib/python3.11/dist-packages (from bertopic) (5.24.1)
Requirement already satisfied: scikit-learn>=1.0 in /usr/local/lib/python3.11/dist-packages (from bertopic) (1.6.1)
Requirement already satisfied: sentence-transformers>=0.4.1 in /usr/local/lib/python3.11/dist-packages (from bertopic) (4.1.0)
Requirement already satisfied: tqdm>=4.41.1 in /usr/local/lib/python3.11/dist-packages (from bertopic) (4.67.1)
Requirement already satisfied: umap-learn>=0.5.0 in /usr/local/lib/python3.11/dist-packages (from bertopic) (0.5.7)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.11/dist-packages (from hdbscan>=0.8.29->bertopic) (1.13.1)
Requirement already satisfied: joblib>=1.0 in /usr/local/lib/python3.11/dist-packages (from hdbscan>=0.8.29->bertopic) (1.5.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.1.5->bertopic) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.1.5->bertopic) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.1.5->bertopic) (2025.2)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.11/dist-packages (from plotly>=4.7.0->bertopic) (9.1.2)
Requirement already satisfied: packaging in /usr/local/lib/python3.11/dist-packages (from plotly>=4.7.0->bertopic) (24.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn>=1.0->bertopic) (3.6.0)
Requirement already satisfied: transformers<5.0.0,>=4.41.0 in /usr/local/lib/python3.11/dist-packages (from sentence-transformers>=0.4.1->bertopic) (4.52.4)
Requirement already satisfied: torch>=1.11.0 in /usr/local/lib/python3.11/dist-packages (from sentence-transformers>=0.4.1->bertopic) (2.6.0+cu124)
Requirement already satisfied: huggingface-hub>=0.20.0 in /usr/local/lib/python3.11/dist-packages (from sentence-transformers>=0.4.1->bertopic) (0.33.0)
Requirement already satisfied: Pillow in /usr/local/lib/python3.11/dist-packages (from sentence-transformers>=0.4.1->bertopic) (11.2.1)
Requirement already satisfied: typing_extensions>=4.5.0 in /usr/local/lib/python3.11/dist-packages (from sentence-transformers>=0.4.1->bertopic) (4.14.0)
Requirement already satisfied: numba>=0.51.2 in /usr/local/lib/python3.11/dist-packages (from umap-learn>=0.5.0->bertopic) (0.60.0)
Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.11/dist-packages (from umap-learn>=0.5.0->bertopic) (0.5.13)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from huggingface-hub>=0.20.0->sentence-transformers>=0.4.1->bertopic) (3.18.0)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.11/dist-packages (from huggingface-hub>=0.20.0->sentence-transformers>=0.4.1->bertopic) (2025.3.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.11/dist-packages (from huggingface-hub>=0.20.0->sentence-transformers>=0.4.1->bertopic) (6.0.2)
Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from huggingface-hub>=0.20.0->sentence-transformers>=0.4.1->bertopic) (2.32.3)
Requirement already satisfied: hf-xet<2.0.0,>=1.1.2 in /usr/local/lib/python3.11/dist-packages (from huggingface-hub>=0.20.0->sentence-transformers>=0.4.1->bertopic) (1.1.3)
Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /usr/local/lib/python3.11/dist-packages (from numba>=0.51.2->umap-learn>=0.5.0->bertopic) (0.43.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas>=1.1.5->bertopic) (1.17.0)
Requirement already satisfied: networkx in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (3.5)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (3.1.6)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (12.4.127)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (12.4.127)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (12.4.127)
Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (9.1.0.70)
Requirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (12.4.5.8)
Requirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (11.2.1.3)
Requirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (10.3.5.147)
Requirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (11.6.1.9)
Requirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (12.3.1.170)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.2 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (0.6.2)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (2.21.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (12.4.127)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (12.4.127)
Requirement already satisfied: triton==3.2.0 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (3.2.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.11/dist-packages (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from sympy==1.13.1->torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (1.3.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.11/dist-packages (from transformers<5.0.0,>=4.41.0->sentence-transformers>=0.4.1->bertopic) (2024.11.6)
Requirement already satisfied: tokenizers<0.22,>=0.21 in /usr/local/lib/python3.11/dist-packages (from transformers<5.0.0,>=4.41.0->sentence-transformers>=0.4.1->bertopic) (0.21.1)
Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.11/dist-packages (from transformers<5.0.0,>=4.41.0->sentence-transformers>=0.4.1->bertopic) (0.5.3)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->torch>=1.11.0->sentence-transformers>=0.4.1->bertopic) (3.0.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->huggingface-hub>=0.20.0->sentence-transformers>=0.4.1->bertopic) (3.4.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->huggingface-hub>=0.20.0->sentence-transformers>=0.4.1->bertopic) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->huggingface-hub>=0.20.0->sentence-transformers>=0.4.1->bertopic) (2.4.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->huggingface-hub>=0.20.0->sentence-transformers>=0.4.1->bertopic) (2025.6.15)
Requirement already satisfied: nlpaug in /usr/local/lib/python3.11/dist-packages (1.1.11)
Requirement already satisfied: numpy>=1.16.2 in /usr/local/lib/python3.11/dist-packages (from nlpaug) (1.26.4)
Requirement already satisfied: pandas>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from nlpaug) (2.2.2)
Requirement already satisfied: requests>=2.22.0 in /usr/local/lib/python3.11/dist-packages (from nlpaug) (2.32.3)
Requirement already satisfied: gdown>=4.0.0 in /usr/local/lib/python3.11/dist-packages (from nlpaug) (5.2.0)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/dist-packages (from gdown>=4.0.0->nlpaug) (4.13.4)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from gdown>=4.0.0->nlpaug) (3.18.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from gdown>=4.0.0->nlpaug) (4.67.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.2.0->nlpaug) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.2.0->nlpaug) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.2.0->nlpaug) (2025.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests>=2.22.0->nlpaug) (3.4.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests>=2.22.0->nlpaug) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests>=2.22.0->nlpaug) (2.4.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests>=2.22.0->nlpaug) (2025.6.15)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas>=1.2.0->nlpaug) (1.17.0)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown>=4.0.0->nlpaug) (2.7)
Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown>=4.0.0->nlpaug) (4.14.0)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown>=4.0.0->nlpaug) (1.7.1)
In [9]:
#importing required libraries

import numpy as np
import time
import json
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import pandas as pd
from google.colab import drive
import pickle
from sklearn.utils.class_weight import compute_class_weight




warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)


# NLP
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from nltk import tokenize, stem
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer
from textblob import TextBlob
from tqdm import tqdm
from wordcloud import WordCloud, STOPWORDS
tqdm.pandas()
import re
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
import nlpaug.augmenter.word as naw
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow.keras.backend as K


# ML and Vectorization
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, ConfusionMatrixDisplay, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.decomposition import PCA, TruncatedSVD
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import keyedvectors
from sklearn.feature_extraction.text import TfidfVectorizer


#ML

from sklearn.base import clone
from sklearn.model_selection import RandomizedSearchCV, cross_val_score, StratifiedKFold


#Transformers
from bertopic import BERTopic
import spacy
from transformers import pipeline
from deep_translator import GoogleTranslator
from collections import Counter
from transformers import BertTokenizer, TFBertModel
from sentence_transformers import SentenceTransformer



#Neural-Network

from tensorflow.keras.models import Sequential,Model
from tensorflow.keras.layers import Dense, Dropout, Embedding, GlobalAveragePooling1D, BatchNormalization,SimpleRNN,Bidirectional,Input,LSTM
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import tensorflow_addons as tfa
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!

Data Exploration¶


In [10]:
#mounting the drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [11]:
# read excel data into dataframe
df_ish = pd.read_excel('/content/drive/MyDrive/Capstone/ISH_dataset.xlsx')
In [12]:
#copying the data in a different dataframe for feature analysis and feature extraction
df_ish_copy= df_ish.copy()
In [13]:
# Load the first 10 rows from data frame
df_ish.head(10)
Out[13]:
Unnamed: 0 Data Countries Local Industry Sector Accident Level Potential Accident Level Genre Employee or Third Party Critical Risk Description
0 0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f...
1 1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum...
2 2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170...
3 3 2016-01-08 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C...
4 4 2016-01-10 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t...
5 5 2016-01-12 Country_02 Local_05 Metals I III Male Third Party (Remote) Pressurized Systems During the unloading operation of the ustulado...
6 6 2016-01-16 Country_02 Local_05 Metals I III Male Employee Fall prevention (same level) The collaborator reports that he was on street...
7 7 2016-01-17 Country_01 Local_04 Mining I III Male Third Party Pressed At approximately 04:50 p.m., when the mechanic...
8 8 2016-01-19 Country_02 Local_02 Mining I IV Male Third Party (Remote) Others Employee was sitting in the resting area at le...
9 9 2016-01-26 Country_01 Local_06 Metals I II Male Third Party Chemical substances At the moment the forklift operator went to ma...
In [14]:
#description of columns

df_ish.describe(include='all').T.round(2)
Out[14]:
count unique top freq mean min 25% 50% 75% max std
Unnamed: 0 425.0 NaN NaN NaN 224.084706 0.0 118.0 226.0 332.0 438.0 125.526786
Data 425 NaN NaN NaN 2016-09-20 16:46:18.352941312 2016-01-01 00:00:00 2016-05-01 00:00:00 2016-09-13 00:00:00 2017-02-08 00:00:00 2017-07-09 00:00:00 NaN
Countries 425 3 Country_01 251 NaN NaN NaN NaN NaN NaN NaN
Local 425 12 Local_03 90 NaN NaN NaN NaN NaN NaN NaN
Industry Sector 425 3 Mining 241 NaN NaN NaN NaN NaN NaN NaN
Accident Level 425 5 I 316 NaN NaN NaN NaN NaN NaN NaN
Potential Accident Level 425 6 IV 143 NaN NaN NaN NaN NaN NaN NaN
Genre 425 2 Male 403 NaN NaN NaN NaN NaN NaN NaN
Employee or Third Party 425 3 Third Party 189 NaN NaN NaN NaN NaN NaN NaN
Critical Risk 425 33 Others 232 NaN NaN NaN NaN NaN NaN NaN
Description 425 411 During the activity of chuteo of ore in hopper... 3 NaN NaN NaN NaN NaN NaN NaN

Data processing & Feature extraction¶


Column Renaming¶

In [15]:
# Renaming 'Data', 'Countries', 'Genre' , 'Employee or Third Party' columns in Data frame

df_ish.rename(columns={"Data": "Date", "Countries": "Country", "Genre": "Gender" , "Employee or Third Party":"Employee Type"},inplace=True)
In [16]:
# Dataset info
df_ish.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425 entries, 0 to 424
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Unnamed: 0                425 non-null    int64         
 1   Date                      425 non-null    datetime64[ns]
 2   Country                   425 non-null    object        
 3   Local                     425 non-null    object        
 4   Industry Sector           425 non-null    object        
 5   Accident Level            425 non-null    object        
 6   Potential Accident Level  425 non-null    object        
 7   Gender                    425 non-null    object        
 8   Employee Type             425 non-null    object        
 9   Critical Risk             425 non-null    object        
 10  Description               425 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(9)
memory usage: 36.7+ KB

Null & Duplicate value detection & removal¶

In [17]:
# Dropping the index column
df_ish.drop('Unnamed: 0', axis=1, inplace=True)
In [18]:
# Check if missing values are in Dataframe , If missing values are present then check if those need to be fixed

df_ish.isnull().sum()
Out[18]:
0
Date 0
Country 0
Local 0
Industry Sector 0
Accident Level 0
Potential Accident Level 0
Gender 0
Employee Type 0
Critical Risk 0
Description 0

In [19]:
#identify Duplicates
df_ish.duplicated().sum()
Out[19]:
7
In [20]:
# Dropping duplicates
df_ish.drop_duplicates(inplace=True, ignore_index=True)
In [21]:
# Dataset info
df_ish.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Date                      418 non-null    datetime64[ns]
 1   Country                   418 non-null    object        
 2   Local                     418 non-null    object        
 3   Industry Sector           418 non-null    object        
 4   Accident Level            418 non-null    object        
 5   Potential Accident Level  418 non-null    object        
 6   Gender                    418 non-null    object        
 7   Employee Type             418 non-null    object        
 8   Critical Risk             418 non-null    object        
 9   Description               418 non-null    object        
dtypes: datetime64[ns](1), object(9)
memory usage: 32.8+ KB
In [22]:
# Describe , Transpose the information  to do the statistical summary identification
df_ish.describe().T.round(2)
Out[22]:
count mean min 25% 50% 75% max
Date 418 2016-09-18 20:50:31.578947328 2016-01-01 00:00:00 2016-04-30 06:00:00 2016-09-06 00:00:00 2017-02-06 12:00:00 2017-07-09 00:00:00

Data Engineering & Feature extraction¶

In [23]:
# Convert 'Date' column to datetime objects
df_ish['Date'] = pd.to_datetime(df_ish['Date'])

# Splitting date, month and year
df_ish['Year'] = df_ish['Date'].apply(lambda x: x.year)
df_ish['Month'] = df_ish['Date'].apply(lambda x: x.month)
df_ish['Day'] = df_ish['Date'].apply(lambda x: x.day)
In [24]:
df_ish.head(10)
Out[24]:
Date Country Local Industry Sector Accident Level Potential Accident Level Gender Employee Type Critical Risk Description Year Month Day
0 2016-01-01 Country_01 Local_01 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f... 2016 1 1
1 2016-01-02 Country_02 Local_02 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum... 2016 1 2
2 2016-01-06 Country_01 Local_03 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170... 2016 1 6
3 2016-01-08 Country_01 Local_04 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C... 2016 1 8
4 2016-01-10 Country_01 Local_04 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t... 2016 1 10
5 2016-01-12 Country_02 Local_05 Metals I III Male Third Party (Remote) Pressurized Systems During the unloading operation of the ustulado... 2016 1 12
6 2016-01-16 Country_02 Local_05 Metals I III Male Employee Fall prevention (same level) The collaborator reports that he was on street... 2016 1 16
7 2016-01-17 Country_01 Local_04 Mining I III Male Third Party Pressed At approximately 04:50 p.m., when the mechanic... 2016 1 17
8 2016-01-19 Country_02 Local_02 Mining I IV Male Third Party (Remote) Others Employee was sitting in the resting area at le... 2016 1 19
9 2016-01-26 Country_01 Local_06 Metals I II Male Third Party Chemical substances At the moment the forklift operator went to ma... 2016 1 26
In [25]:
#function to seperate cat and num columns

def cat_num_variable(data):
  cat_var=[]
  num_var=[]
  for i in data.columns:
    if data[i].dtype=='object':
      cat_var.append(i)
    else:
      num_var.append(i)
  print(f'Catagorical variable : {cat_var}')
  print(f'Numerical variable : {num_var}')
  return cat_var,num_var

cat_variable=[]
num_variable=[]
cat_variable,num_variable=cat_num_variable(df_ish)




def aggregate_others(df, column, max_categories=10):
    counts = df[column].value_counts()
    top_categories = counts.nlargest(max_categories).index
    df[column] = df[column].apply(lambda x: x if x in top_categories else 'Others')
    return df
Catagorical variable : ['Country', 'Local', 'Industry Sector', 'Accident Level', 'Potential Accident Level', 'Gender', 'Employee Type', 'Critical Risk', 'Description']
Numerical variable : ['Date', 'Year', 'Month', 'Day']

Exploratory Data Analysis (EDA)¶


Univariated Analysis¶

In [26]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

def univariate_graphPrep(graphtype, variableName, df, **kwargs):
    low_card = []
    high_card = []

    # Split columns
    for col in variableName:
        if df[col].nunique() <= 7:
            low_card.append(col)
        else:
            high_card.append(col)

    print(f"Low-cardinality columns: {len(low_card)} | High-cardinality columns: {len(high_card)}")

    # Plot low-cardinality columns
    if low_card:
        size = len(low_card)
        n_cols = 2
        n_rows = int(np.ceil(size / n_cols))
        fig, axes = plt.subplots(n_rows, n_cols, figsize=(10 * n_cols, 6 * n_rows))
        axes = axes.flatten() if size > 1 else [axes]

        for i, col in enumerate(low_card):
            order = df[col].value_counts().index
            if graphtype == "histogram":
                sns.histplot(data=df, ax=axes[i], x=col, bins='auto', stat='count', kde=True, color='skyblue')
            elif graphtype == "boxplot":
                sns.boxplot(data=df, ax=axes[i], x=col, **kwargs)
            elif graphtype == "countplot":
                sns.countplot(data=df, ax=axes[i], y=col, palette='Set3',order=order, **kwargs)
            elif graphtype == "piechart":
                value_counts = df[col].value_counts()
                axes[i].pie(value_counts, labels=value_counts.index, autopct='%1.1f%%',
                            startangle=140, textprops={'fontsize': 10},
                            colors=sns.color_palette('Set3'))
                axes[i].axis('equal')
            axes[i].set_title(f'{graphtype.capitalize()} of {col}')

        for j in range(len(low_card), len(axes)):
            fig.delaxes(axes[j])

        plt.tight_layout()
        plt.show()

    # Plot high-cardinality columns
    if high_card:
        size = len(high_card)
        n_cols = 1
        n_rows = size
        fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 4 * n_rows))
        axes = axes.flatten() if size > 1 else [axes]

        for i, col in enumerate(high_card):
            order = df[col].value_counts().index
            if graphtype == "histogram":
                sns.histplot(data=df, ax=axes[i], x=col, bins='auto', stat='count', kde=True, color='skyblue')
            elif graphtype == "boxplot":
                sns.boxplot(data=df, ax=axes[i], x=col, **kwargs)
            elif graphtype == "countplot":

                sns.countplot(data=df, ax=axes[i], x=col, palette='Set3',order=order,  **kwargs)
            elif graphtype == "piechart":
                value_counts = df[col].value_counts().nlargest(10)  # show top 10 to avoid clutter
                axes[i].pie(value_counts, labels=value_counts.index, autopct='%1.1f%%',
                            startangle=140, textprops={'fontsize': 10},
                            colors=sns.color_palette('Set3'))
                axes[i].axis('equal')
            axes[i].set_title(f'{graphtype.capitalize()} of {col} (high-cardinality)')
        plt.xticks(rotation=90)
        plt.tight_layout()
        plt.show()

Countplot Analysis¶

In [27]:
#univariate analysis
univariate_graphPrep('countplot',cat_variable[:-1],df_ish)
Low-cardinality columns: 6 | High-cardinality columns: 2
No description has been provided for this image
No description has been provided for this image
In [28]:
# Extract day names
df_ish['Weekday'] = df_ish['Date'].dt.day_name()


# Define seasons based on month
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Autumn'

# Create season column
df_ish['Season'] = df_ish['Date'].dt.month.map(get_season)


univariate_graphPrep('piechart',['Month','Season', 'Weekday'],df_ish )
Low-cardinality columns: 2 | High-cardinality columns: 1
No description has been provided for this image
No description has been provided for this image
In [29]:
for year in df_ish['Year'].unique():
  print(f'Year : {year}')
  univariate_graphPrep('countplot',['Month', 'Accident Level', 'Country', 'Industry Sector'],pd.DataFrame(df_ish[df_ish['Year']==year]))
Year : 2016
Low-cardinality columns: 3 | High-cardinality columns: 1
No description has been provided for this image
No description has been provided for this image
Year : 2017
Low-cardinality columns: 4 | High-cardinality columns: 0
No description has been provided for this image

Bivarated Analysis¶

In [30]:
def bivariate_graphPrep(graphtype, variablePairs, df, n_cols=2, figsize=None, **kwargs):
    high_card=[]
    low_card=[]
    for i, ((var1, var2), hue) in enumerate(variablePairs):
        if df[var1].nunique() >= 7:
            high_card.append(((var1, var2), hue))
            # continue
        else:
            low_card.append(((var1, var2), hue))

    if low_card:
      n_plots = len(variablePairs)
      n_rows = int(np.ceil(n_plots / n_cols))

      if figsize is None:
        figsize = (20 * n_cols, 15 * n_rows)

      fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)
    # If only one plot, axes is not an array, so convert to array for uniformity
      if n_plots == 1:
        axes = np.array([axes])
      else:
        axes = axes.flatten()

      for i, ((var1, var2), hue) in enumerate(low_card):
        ax = axes[i]
        plot_kwargs = kwargs.copy()
        if hue:
            plot_kwargs['hue'] = hue
        try:
            if graphtype == "histplot":
                # histplot with x and y; kde only supports univariate, so disabling kde
                sns.histplot(data=df, x=var1, y=var2, bins='auto', stat='count',
                             palette='Spectral', ax=ax, **plot_kwargs)
                ax.set_title(f'Histogram of {var1} vs {var2}' + (f' (hue={hue})' if hue else ''))
                ax.set_xlabel(var1, fontsize=14)
                ax.set_ylabel(var2, fontsize=14)

            elif graphtype == "violinplot":
                sns.violinplot(data=df, x=var1, y=var2, ax=ax, **plot_kwargs)
                ax.set_title(f'Violinplot of {var1} vs {var2}' + (f' (hue={hue})' if hue else ''), fontsize=16)
                ax.set_xlabel(var1, fontsize=14)
                ax.set_ylabel(var2, fontsize=14)

            elif graphtype == "countplot":
                order =df[var1].value_counts().index

                # countplot only supports x and optionally hue
                if var2 is None or var2 == '':
                    sns.countplot(data=df, x=var1 ,ax=ax,order=order, **plot_kwargs)
                    ax.set_ylabel("Count", fontsize=15)
                else:
                    ax.text(0.5, 0.5, "Countplot supports only x (and hue).", ha='center', fontsize=12)
                    ax.set_axis_off()
                ax.set_title(f'Countplot of {var1}' + (f' (hue={hue})' if hue else ''), fontsize=16)
                ax.set_xlabel(var1, fontsize=15)

            elif graphtype == "barplot":
                # barplot requires both x and y
                if var2 and var2 != '':
                    sns.barplot(data=df, x=var1, y=var2, palette='Spectral', ax=ax, ci=None, **plot_kwargs)
                    ax.set_ylabel(var2, fontsize=14)
                else:
                    ax.text(0.5, 0.5, "Barplot needs both x and y.", ha='center', fontsize=12)
                    ax.set_axis_off()
                ax.set_title(f'Barplot of {var1} vs {var2}' + (f' (hue={hue})' if hue else ''), fontsize=16)
                ax.set_xlabel(var1, fontsize=14)

            elif graphtype == "scatterplot":
                if var1 and var2 and var1 != '' and var2 != '':
                    sns.scatterplot(data=df, x=var1, y=var2, ax=ax, **plot_kwargs)
                    ax.set_title(f'Scatterplot of {var1} vs {var2}' + (f' (hue={hue})' if hue else ''), fontsize=16)
                    ax.set_xlabel(var1, fontsize=14)
                    ax.set_ylabel(var2, fontsize=14)
                else:
                    ax.text(0.5, 0.5, "Scatterplot needs both x and y.", ha='center', fontsize=12)
                    ax.set_axis_off()

            else:
                ax.text(0.5, 0.5, f"Unsupported graphtype: {graphtype}",
                        ha='center', va='center', fontsize=12)
                ax.set_axis_off()

            ax.tick_params(axis='x', rotation=90, labelsize=17, pad=5)
            ax.tick_params(axis='y', labelsize=17, pad=5)

            # Manage legend
            if hue:
                if ax.get_legend() is None:
                    ax.legend(title=hue, fontsize=10, title_fontsize=12)
                else:
                    ax.legend(title=hue, fontsize=10, title_fontsize=12)
            else:
                if ax.get_legend() is not None:
                    ax.get_legend().remove()

        except Exception as e:
            ax.text(0.5, 0.5, f"Error plotting:\n{e}",
                    ha='center', va='center', fontsize=10, color='red')
            ax.set_axis_off()

    # Remove unused axes if any
      for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

      plt.tight_layout(pad=2.0)
      plt.show()

    if high_card:
        size = len(high_card)
        n_cols = 1
        n_rows = size
        fig, axes = plt.subplots(n_rows, n_cols, figsize=(13, 8 * n_rows))
        axes = axes.flatten() if size > 1 else [axes]
        for idx,((var1, var2), hue) in enumerate(high_card):
          ax = axes[idx]
          plot_kwargs = kwargs.copy()
          if hue:
           plot_kwargs['hue'] = hue
           try:
             if graphtype == "histplot":
                # histplot with x and y; kde only supports univariate, so disabling kde
                sns.histplot(data=df, x=var1, y=var2, bins='auto', stat='count',
                             palette='Spectral', ax=ax, **plot_kwargs)
                ax.set_title(f'Histogram of {var1} vs {var2}' + (f' (hue={hue})' if hue else ''))
                ax.set_xlabel(var1, fontsize=14)
                ax.set_ylabel(var2, fontsize=14)

             elif graphtype == "violinplot":
                sns.violinplot(data=df, x=var1, y=var2, ax=ax, **plot_kwargs)
                ax.set_title(f'Violinplot of {var1} vs {var2}' + (f' (hue={hue})' if hue else ''), fontsize=16)
                ax.set_xlabel(var1, fontsize=14)
                ax.set_ylabel(var2, fontsize=14)

             elif graphtype == "countplot":
                order=df[var1].value_counts().index
                # countplot only supports x and optionally hue
                if var2 is None or var2 == '':
                    sns.countplot(data=df, x=var1, ax=ax,order=order,  **plot_kwargs)
                    ax.set_ylabel("Count", fontsize=12)
                else:
                    ax.text(0.5, 0.5, "Countplot supports only x (and hue).", ha='center', fontsize=12)
                    ax.set_axis_off()
                ax.set_title(f'Countplot of {var1}' + (f' (hue={hue})' if hue else ''), fontsize=16)
                ax.set_xlabel(var1, fontsize=12)

             elif graphtype == "barplot":
                # barplot requires both x and y
                if var2 and var2 != '':
                    sns.barplot(data=df, x=var1, y=var2, palette='Spectral', ax=ax, ci=None, **plot_kwargs)
                    ax.set_ylabel(var2, fontsize=14)
                else:
                    ax.text(0.5, 0.5, "Barplot needs both x and y.", ha='center', fontsize=12)
                    ax.set_axis_off()
                ax.set_title(f'Barplot of {var1} vs {var2}' + (f' (hue={hue})' if hue else ''), fontsize=16)
                ax.set_xlabel(var1, fontsize=14)

             elif graphtype == "scatterplot":
                if var1 and var2 and var1 != '' and var2 != '':
                    sns.swarmplot(data=df, x=var1, y=var2, ax=ax,**plot_kwargs)
                    ax.set_title(f'Scatterplot of {var1} vs {var2}' + (f' (hue={hue})' if hue else ''), fontsize=16)
                    ax.set_xlabel(var1.replace('_Code',''), fontsize=14)
                    ax.set_ylabel(var2.replace('_Code',''), fontsize=14)
                    ax.grid(True, which='both', linestyle='--', linewidth=0.5, alpha=0.7)
                else:
                    ax.text(0.5, 0.5, "Scatterplot needs both x and y.", ha='center', fontsize=12)
                    ax.set_axis_off()

             else:
                ax.text(0.5, 0.5, f"Unsupported graphtype: {graphtype}",
                        ha='center', va='center', fontsize=12)
                ax.set_axis_off()

             ax.tick_params(axis='x', rotation=90, labelsize=8, pad=5)
             ax.tick_params(axis='y', labelsize=8, pad=5)

            # Manage legend
             if hue:
                if ax.get_legend() is None:
                    ax.legend(title=hue, fontsize=10, title_fontsize=12)
                else:
                    ax.legend(title=hue, fontsize=10, title_fontsize=12)
             else:
                if ax.get_legend() is not None:
                    ax.get_legend().remove()

           except Exception as e:
            ax.text(0.5, 0.5, f"Error plotting:\n{e}",
                    ha='center', va='center', fontsize=10, color='red')
            ax.set_axis_off()

    # Remove unused axes if any
        for j in range(idx + 1, len(axes)):
          fig.delaxes(axes[j])

    plt.tight_layout(pad=2.0)
    plt.show()





variablePairs = [
    (('Industry Sector', 'Month'),'Country'),
    (('Accident Level', 'Month'),'Country'),
    (('Accident Level', 'Country'),'Gender'),
    (('Accident Level', 'Potential Accident Level'),'Country'),
    (('Gender', 'Potential Accident Level'),'Country'),
    (('Country', 'Accident Level'),'Industry Sector'),
    (('Country', 'Potential Accident Level'),'Industry Sector'),
    (('Critical Risk', 'Country'),'Industry Sector'),
    (('Critical Risk', 'Country'),'Employee Type'),
    (('Country', 'Employee Type'),'Industry Sector'),
    (('Industry Sector', 'Accident Level'),'Season' ),
    (('Industry Sector', 'Year'),'Season' )
]

Bivariated VilonPlot¶

In [31]:
#bivariate

bivariate_graphPrep('violinplot',variablePairs ,df_ish)
No description has been provided for this image
No description has been provided for this image

Countplot of relation between independent variable¶

In [32]:
CountplotVariablePairs = [
    (('Month','' ), 'Accident Level'),
    (('Month','' ), 'Industry Sector'),
    (('Accident Level', ''), 'Gender'),
    (('Accident Level',''), 'Potential Accident Level'),
    (( 'Potential Accident Level',''),'Gender'),
    (( 'Accident Level',''), 'Country'),
    (('Critical Risk',''), 'Country'),
    (('Critical Risk',''), 'Industry Sector'),
    (('Accident Level',''), 'Industry Sector'),
    (('Industry Sector',''), 'Year'),
    (('Accident Level', ''), 'Season'),
    (('Industry Sector',''), 'Season'),
]


bivariate_graphPrep('countplot',CountplotVariablePairs ,df_ish, n_cols=3)
No description has been provided for this image
No description has been provided for this image
In [33]:
df_ish['Year'] = df_ish['Date'].dt.year

# Count of critical risks per year
risk_year_counts = df_ish.groupby(['Year', 'Critical Risk']).size().unstack(fill_value=0)

# Plot
plt.figure(figsize=(16, 10))
risk_year_counts.plot(kind='bar', stacked=True, colormap='viridis', figsize=(16, 8))

plt.title('Distribution of Critical Risks by Year', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of Accidents', fontsize=12)
plt.xticks(rotation=45)
plt.legend(title='Critical Risk', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
<Figure size 1600x1000 with 0 Axes>
No description has been provided for this image

Scatterplot¶

In [34]:
# Plot settings
bivariate_graphPrep('scatterplot',[( ('Critical Risk', 'Accident Level'),'Industry Sector')],df_ish,n_cols=1);
No description has been provided for this image
In [35]:
# Plot settings
bivariate_graphPrep('scatterplot', [(('Critical Risk','Season'), 'Industry Sector') ],df_ish,1);
No description has been provided for this image

Feature extraction after analysis and catagoriation¶

In [36]:
#Data engineering

df_ish.dtypes

#creating month_year column
df_ish['Month_year'] = df_ish['Month'].astype(str) + '-' + df_ish['Year'].astype(str)



#creating locality code column
df_ish['Local_code'] = df_ish['Country'].astype(str) + '-' + df_ish['Local'].astype(str)


#dropping year, month, day, country, local columns
df_ish.drop(['Year','Date', 'Month', 'Day', 'Country','Local'], inplace=True, axis=1)
In [37]:
#again defining cat and num cols

df_ish_cat_variable,df_ish_num_variable= cat_num_variable(df_ish)

df_ish_cat_variable.remove('Description')
print(f'Catagorical variables after Removing Description :{df_ish_cat_variable}')
Catagorical variable : ['Industry Sector', 'Accident Level', 'Potential Accident Level', 'Gender', 'Employee Type', 'Critical Risk', 'Description', 'Weekday', 'Season', 'Month_year', 'Local_code']
Numerical variable : []
Catagorical variables after Removing Description :['Industry Sector', 'Accident Level', 'Potential Accident Level', 'Gender', 'Employee Type', 'Critical Risk', 'Weekday', 'Season', 'Month_year', 'Local_code']

Data Preparation¶


Target variable Label encoding¶

In [38]:
df_ish['Accident Level'].value_counts()
Out[38]:
count
Accident Level
I 309
II 40
III 31
IV 30
V 8

In [39]:
encoded_df_ish = df_ish.copy()
In [40]:
#column mapping for accident level
category_mapping = {
    'I': 0,
    'II': 0,
    'III': 1,
    'IV': 1,
    'V': 2
}

#column mapping applied to accident level
encoded_df_ish['Accident_Level_groupped'] = encoded_df_ish['Accident Level'].map(category_mapping)
In [41]:
encoded_df_ish.head(5)
Out[41]:
Industry Sector Accident Level Potential Accident Level Gender Employee Type Critical Risk Description Weekday Season Month_year Local_code Accident_Level_groupped
0 Mining I IV Male Third Party Pressed While removing the drill rod of the Jumbo 08 f... Friday Winter 1-2016 Country_01-Local_01 0
1 Mining I IV Male Employee Pressurized Systems During the activation of a sodium sulphide pum... Saturday Winter 1-2016 Country_02-Local_02 0
2 Mining I III Male Third Party (Remote) Manual Tools In the sub-station MILPO located at level +170... Wednesday Winter 1-2016 Country_01-Local_03 0
3 Mining I I Male Third Party Others Being 9:45 am. approximately in the Nv. 1880 C... Friday Winter 1-2016 Country_01-Local_04 0
4 Mining IV IV Male Third Party Others Approximately at 11:45 a.m. in circumstances t... Sunday Winter 1-2016 Country_01-Local_04 1

Text preprocessing¶

In [42]:
#defining stopwords

STOPWORDS.update(["cm", "kg", "mr", "wa" ,"nv", "ore", "da", "pm", "am", "cx","cm", "my","cm x","one", "the", "and", "that" ,'.', ',','employee', 'on', 'wet', 'floor', 'causing', 'injury', 'to', 'left',
 ])
print(STOPWORDS)
{"you'll", 'here', 'com', 'yours', "couldn't", 'because', "it's", 'wa', "he'd", "doesn't", 'about', 'and', "we'll", "that's", 'one', 'when', 'the', 'why', 'below', "when's", 'both', "who's", 'get', 'causing', 'there', 'injury', 'shall', 'been', 'not', 'having', 'do', 'own', 'no', 'too', 'cm', "they'd", 'after', 'cannot', 'down', "he's", 'da', "they've", "don't", 'few', "how's", "you're", 'under', 'floor', "they're", "mustn't", 'otherwise', 'he', "can't", 'into', 'mr', 'just', 'with', "here's", 'their', 'those', 'hence', 'would', 'or', 'does', 'we', 'again', 'over', 'itself', 'but', 'since', "we'd", "there's", 'it', 'my', 'an', 'how', 'yourself', 'hers', "where's", "wasn't", 'his', 'have', ',', 'until', 'be', "weren't", 'by', 'cx', 'if', 'during', "she'll", 'yourselves', 'that', 'who', 'in', 'only', "you've", "wouldn't", 'while', 'then', "let's", 'these', 'himself', 'to', 'k', 'before', "didn't", "what's", 'each', 'such', 'our', "hasn't", 'they', 'of', "i'll", 'themselves', "shan't", 'she', "we've", 'further', "i'd", 'very', "aren't", 'wet', 'can', 'what', 'kg', 'www', 'like', 'nv', "she's", 'most', 'ours', "haven't", 'which', 'a', 'am', 'had', 'at', 'through', 'out', 'same', 'were', "shouldn't", 'all', 'doing', "they'll", 'me', 'are', 'against', 'from', 'some', '.', 'being', 'off', 'pm', "why's", 'is', 'cm x', 'where', 'whom', 'employee', 'on', 'ore', 'therefore', 'her', 'you', 'other', 'i', "won't", 'could', 'between', 'should', 'nor', 'else', 'r', 'so', 'its', "we're", 'ever', "i'm", 'myself', 'herself', 'http', 'theirs', "hadn't", "you'd", 'above', 'was', 'than', 'did', 'ourselves', 'this', 'any', 'however', 'has', "isn't", 'ought', 'your', 'left', 'him', 'for', 'more', 'once', 'as', "she'd", "he'll", "i've", 'up', 'also', 'them'}
In [43]:
#function for text preprocessing

def text_preprocessing(row):
    sentence = row.Description
    #convert all characters to lowercase
    lowered = sentence.lower()

     #remove alphanumeric charcater
    pattern=r'[^a-zA-Z\s]'
    formatted=re.sub(pattern, ' ', lowered)

    formatted = re.sub(r'\b[a-zA-Z]\b', '', formatted)

    #  removing white space
    formatted = formatted.strip()

    tokens = tokenize.word_tokenize(formatted)
    tokens = [token for token in tokens if len(token)>2]

    #lemmatizing & stemming
    lemmatizer = stem.WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(i) for i in tokens if i not in STOPWORDS]


    return " ".join(lemmatized)
In [44]:
#preprocessing function applied to description column

encoded_df_ish['Description_processed'] = encoded_df_ish.apply(text_preprocessing, axis=1)
In [45]:
#description visulaization before and after preprocessing

encoded_df_ish[['Description','Description_processed']].head(5)
Out[45]:
Description Description_processed
0 While removing the drill rod of the Jumbo 08 f... removing drill rod jumbo maintenance superviso...
1 During the activation of a sodium sulphide pum... activation sodium sulphide pump piping uncoupl...
2 In the sub-station MILPO located at level +170... sub station milpo located level collaborator e...
3 Being 9:45 am. approximately in the Nv. 1880 C... approximately personnel begin task unlocking s...
4 Approximately at 11:45 a.m. in circumstances t... approximately circumstance mechanic anthony gr...

Text processing Analysis¶

Word Cloud Analysis¶

In [46]:
wordcloud = WordCloud(width = 1500, height = 800, random_state=0, background_color='black', colormap='rainbow', \
                      min_font_size=5, max_words=300, collocations=False, min_word_length=3, stopwords = STOPWORDS).generate(" ".join(encoded_df_ish['Description_processed'].values))
plt.figure(figsize=(15,10))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
No description has been provided for this image

unigram, Bi-Gram & tri-gram Presentation¶

In [47]:
def ngram_func(ngram, trg='', trg_value=''):
    #trg_value is list-object
    if (trg == '') or (trg_value == ''):
        string_filterd =  encoded_df_ish['Description_processed'].sum().split()
    else:
        string_filterd =  encoded_df_ish[Desc_df_ish[trg].isin(trg_value)]['Description_processed'].sum().split()
    dic = nltk.FreqDist(nltk.ngrams(string_filterd, ngram)).most_common(30)
    ngram_df = pd.DataFrame(dic, columns=['ngram','count'])
    ngram_df.index = [' '.join(i) for i in ngram_df.ngram]
    ngram_df.drop('ngram',axis=1, inplace=True)
    return ngram_df
In [48]:
def create_ngram_chart(ngram_df, title):
  plt.figure(figsize=(10, 6))
  sns.barplot(y=ngram_df.index, x='count', data=ngram_df)
  plt.title(title)
  plt.show()

# Unigrams
unigram_df = ngram_func(1)
create_ngram_chart(unigram_df, 'Top 30 Unigrams')

# Bigrams
bigram_df = ngram_func(2)
create_ngram_chart(bigram_df, 'Top 30 Bigrams')

# Trigrams
trigram_df = ngram_func(3)
create_ngram_chart(trigram_df, 'Top 30 Trigrams')
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Text embedding¶

In [49]:
#defining y variable

y=encoded_df_ish['Accident_Level_groupped']

GLOVE Embedding¶

In [50]:
#defining glove model
glove_input_file='/content/drive/MyDrive/Capstone/glove.6B.100d.txt'
glove_output_file= '/content/drive/MyDrive/Capstone/glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file,glove_output_file)
model_glove = keyedvectors.load_word2vec_format(glove_output_file, binary = False)
with open('/content/drive/MyDrive/Capstone/glove_model.pkl', 'wb') as file:
     pickle.dump(model_glove, file)
In [51]:
with open('/content/drive/MyDrive/Capstone/glove_model.pkl', 'rb') as file:
    model_glove = pickle.load(file)
In [52]:
#creating glove model dictionary


print(f'Length of Glove vocab is : {len(model_glove.index_to_key)}')

glove_words = model_glove.index_to_key
glove_words_vectors=model_glove.vectors

glove_dictionary = dict(zip(glove_words, glove_words_vectors))
Length of Glove vocab is : 400000
In [53]:
#function to average the vector for glove model

vec_size=100

def average_vectorization_glove(text):
    feature_vector= np.zeros((vec_size,), dtype='float64')

    words_in_vocab= [word for word in text.split() if word in glove_words]

    for word in words_in_vocab:
        feature_vector+=np.array(glove_dictionary[word])

    if words_in_vocab != 0:
        feature_vector/=len(words_in_vocab)

    return feature_vector
In [54]:
#creating test train split on glove embedding data as independent variable
np.random.seed(42)

X_glove_embedded=np.array(encoded_df_ish['Description_processed'].apply(average_vectorization_glove).tolist())

X_glove_embedded_df =pd.DataFrame(X_glove_embedded)

X_glove_train, X_glove_test, y_glove_train, y_glove_test = train_test_split(X_glove_embedded_df, y, test_size=0.3, random_state=42)

TF-IDF Vectorization¶

In [55]:
#TF-IDF Vectorization
np.random.seed(42)
tfidf = TfidfVectorizer(max_features=3000, ngram_range=(1,3), stop_words='english')
X_tfidf_embedding = tfidf.fit_transform(encoded_df_ish['Description_processed']).toarray()

svd = TruncatedSVD(n_components=100, random_state=42)
X_tfidf_train_svd_selected = svd.fit_transform(X_tfidf_embedding)

X_tfidf_train_svd_selected_df =pd.DataFrame(X_tfidf_train_svd_selected)


#creating test train split and data on TF-IDF embedding data as independent variable
np.random.seed(42)

X_tfidf_train, X_tfidf_test, y_tfidf_train, y_tfidf_test = train_test_split(
    X_tfidf_train_svd_selected_df, y, test_size=0.3,random_state=42)

Utility functions for evaluating model performance¶

In [56]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score
import pandas as pd
import matplotlib.pyplot as plt

columns= ['Classifier', 'Train Accuracy', 'Test Accuracy','Train F1 Score(avg)', 'Test F1 Score(avg)','Train F1 Score(macro)', 'Test F1 Score(macro)','Train Recall', 'Test Recall','Train Precision (avg)' ,'Test Precision (avg)', 'Test Time','cv score']
result_df_final = pd.DataFrame(columns=columns)

def confusion_metrics(model, model_name, embedding_matrix, y_train, y_pred_train, y_test, y_pred):
    test_time = None
    cv_score = None

    cm = confusion_matrix(y_test, y_pred)
    display = ConfusionMatrixDisplay(confusion_matrix=cm)
    display.plot(cmap='Blues')

    acc_test = accuracy_score(y_test, y_pred)
    precision_test_macro = precision_score(y_test, y_pred, average='macro')
    precision_test_avg = precision_score(y_test, y_pred, average='weighted')

    recall_test = recall_score(y_test, y_pred, average='weighted')
    f1_test = f1_score(y_test, y_pred, average='weighted')
    f1_test_macro=f1_score(y_test, y_pred, average='macro')
    acc_train = accuracy_score(y_train, y_pred_train)
    f1_train = f1_score(y_train, y_pred_train, average='weighted')
    f1_train_macro = f1_score(y_train, y_pred_train, average='macro')

    recall_train = recall_score(y_train, y_pred_train, average='weighted')
    precision_train_avg = precision_score(y_train, y_pred_train, average='weighted')
    precision_train_macro = precision_score(y_train, y_pred_train, average='macro')


    print(f'Accuracy : {acc_test}')
    print(f'Precision (macro) : {precision_test_macro}')
    print(f'Precision (avg): {precision_test_avg}')
    print(f'Recall : {recall_test}')
    print(f'F1-Score : {f1_test}')
    plt.grid(True)
    plt.show()

    full_model_name = f'{model_name} with {embedding_matrix}'

    global result_df_final
    new_row = pd.DataFrame([[
    full_model_name,
    acc_train, acc_test,
    f1_train, f1_test,
    f1_train_macro, f1_test_macro,
    recall_train, recall_test,
    precision_train_avg, precision_test_avg,
    test_time if test_time else '-',
    cv_score if cv_score else '-'
    ]], columns=result_df_final.columns)

    result_df_final = pd.concat([result_df_final, new_row], ignore_index=True)


    return result_df_final

ML Model Building & Performance Analysis¶


ML Model fitting function declaration¶

In [57]:
#function to run the model with CV and its hyperparameter tuning and check its  confusion matrix

def modelFitting(model, model_name, embedding_matrix, X_train, y_train, X_test, y_test, isGridSearchNeeded, param_grid):
    start = time.time()
    result = []
    score_mean = None

    # Ensure inputs are NumPy arrays
    X_train = np.asarray(X_train)
    X_test = np.asarray(X_test)
    y_train = np.asarray(y_train)
    y_test = np.asarray(y_test)

    # Convert to DataFrame for feature selection (if needed)
    X_train_df = pd.DataFrame(X_train)
    X_test_df = pd.DataFrame(X_test)

    cv = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

    # Fit initial model to compute feature importance
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_pred_train = model.predict(X_train)

    # Feature importance handling
    try:
        importances = pd.Series(model.feature_importances_, index=X_train_df.columns)
    except AttributeError:
        if isinstance(model, VotingClassifier):
            try:
                rf_importances = model.named_estimators_['rf'].feature_importances_
                xgb_importances = model.named_estimators_['xgb'].feature_importances_
                importances = pd.Series((rf_importances + xgb_importances) / 2, index=X_train_df.columns)
            except Exception as e:
                print(f"Could not compute ensemble feature importances: {e}")
                importances = pd.Series(np.zeros(X_train_df.shape[1]), index=X_train_df.columns)
        else:
            print(f"{model_name} does not support feature_importances_")
            importances = pd.Series(np.zeros(X_train_df.shape[1]), index=X_train_df.columns)

    # Select least important features dynamically (e.g., bottom 10%)
    n_features_to_drop = int(0.1 * X_train_df.shape[1])  # Drop bottom 10% of features
    least_important_features = importances.sort_values(ascending=True).head(n_features_to_drop).index.tolist()

    X_train_reduced = X_train_df.drop(columns=least_important_features, errors='ignore')
    X_test_reduced = X_test_df.drop(columns=least_important_features, errors='ignore')

    # Clone model for fresh training
    model = clone(model)

    if isGridSearchNeeded:
        search = RandomizedSearchCV(model, param_grid, cv=cv, n_jobs=-1, verbose=2, scoring='accuracy', n_iter=20, random_state=42)
        search.fit(X_train_reduced, y_train)
        model = search.best_estimator_
        y_pred = search.predict(X_test_reduced)
        y_pred_train = search.predict(X_train_reduced)
        score_mean = search.best_score_
        print(f'Grid Search Best Estimator: {search.best_estimator_}')
        print(f'Grid Search Best Parameters: {search.best_params_}')
        print(f'Grid Search Best Score: {score_mean}')
    else:
        scores = cross_val_score(model, X_train_reduced, y_train, cv=cv, scoring='accuracy')
        score_mean = scores.mean()
        print("Cross-validation scores:", scores)
        print("Mean accuracy:", score_mean)
        model.fit(X_train_reduced, y_train)
        y_pred = model.predict(X_test_reduced)
        y_pred_train = model.predict(X_train_reduced)

    # Handle predictions (avoid np.argmax for non-probability outputs)
    if hasattr(model, 'predict_proba'):
        y_pred_train = np.argmax(model.predict_proba(X_train_reduced), axis=1) if y_pred_train.ndim > 1 else y_pred_train
        y_pred = np.argmax(model.predict_proba(X_test_reduced), axis=1) if y_pred.ndim > 1 else y_pred
    else:
        y_pred_train = y_pred_train.flatten()
        y_pred = y_pred.flatten()

    print(f"\n=== {model_name} - Test Set ===")
    confusion_metrics(model, model_name, embedding_matrix, y_train, y_pred_train, y_test, y_pred)

    # Training metrics
    training_time = time.time() - start
    acc_train = accuracy_score(y_train, y_pred_train)
    f1_train = f1_score(y_train, y_pred_train, average='weighted')
    recall_train = recall_score(y_train, y_pred_train, average='weighted')
    precision_train = precision_score(y_train, y_pred_train, average='weighted')

    # Test metrics
    start = time.time()
    y_pred = model.predict(X_test_reduced)  # Recompute predictions for accurate timing
    test_time = time.time() - start
    acc_test = accuracy_score(y_test, y_pred)
    f1_test = f1_score(y_test, y_pred, average='weighted')
    recall_test = recall_score(y_test, y_pred, average='weighted')
    precision_test = precision_score(y_test, y_pred, average='weighted')

    result.append([model_name, acc_train, f1_train, recall_train, precision_train, training_time,
                   acc_test, f1_test, recall_test, precision_test, test_time, score_mean])
    return result


# function to add the result in a dataframe
def resultToDf(result):
    columns= ['Classifier', 'Train Accuracy', 'Train F1 Score', 'Train Recall', 'Train Precision', 'Training Time', 'Test Accuracy', 'Test F1 Score', 'Test Recall', 'Test Precision', 'Test Time','cv score']
    return pd.DataFrame(result, columns=columns)

#model to check te performance and return the input dataframe and its result
def checkModelPerformance(embedding_matrix, isGridSearchNeeded, paramGrid):
  results=[]
  if embedding_matrix == 'glove':
    X_train, X_test, y_train, y_test = X_glove_train, X_glove_test, y_glove_train, y_glove_test

  else:
    X_train, X_test, y_train, y_test = X_tfidf_train, X_tfidf_test, y_tfidf_train, y_tfidf_test


  df =pd.DataFrame(X_train)
  print(f' Dataframe: \n{df.head(5)}')

  for model_name, model in models.items():
        print(f"\n---------------- {model_name} ----------------")

        model = clone(model)
        param_grid = paramGrids.get(model_name, {})  # get param grid for this model, or empty if none
        model_result = modelFitting(model, model_name, embedding_matrix,X_train, y_train, X_test, y_test, isGridSearchNeeded, param_grid)
        results.extend(model_result)
  results_df= resultToDf(results)
  return df,  results_df

XG Boost and Random Forest¶

In [58]:
#Model Building

model_xgboost = xgb.XGBClassifier(class_weight='balanced', random_state=42, subsample=0.8, eval_metric='mlogloss')
model_rf= RandomForestClassifier(class_weight='balanced', bootstrap=True)

Voting Classifier models¶

In [59]:
#Voting classifier on random forest and XG boost

ensemble_model = VotingClassifier(
    estimators=[
        ('rf', model_rf),
        ('xgb', model_xgboost)
    ],
    voting='soft'
)

#Voting classifier with weight on random forest and XG boost


ensemble_model_w = VotingClassifier(
    estimators=[
        ('rf', model_rf),
        ('xgb', model_xgboost)
    ],
    voting='soft',
    weights=[3,2]  )

Param declaration for Hyper parameter tuning¶

In [60]:
#defining models and their hyperparmeter in JSON

models={
    'Random Forest':model_rf,
    'xgboost':model_xgboost,
    'ensemble Model' : ensemble_model,
    'ensemble Model w' : ensemble_model_w,
}

paramGrids = {
    'xgboost': {
        'n_estimators': [250, 230, 270],
        'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'max_depth': [5, 4, 6],
    'learning_rate': [0.01, 0.08, 0.1],
    'scale_pos_weight':['auto_weight_for_imbalance'],
    'class_weight': ['balanced']

    },
    'Random Forest': {
        'n_estimators': [420, 400,370],
    'max_depth': [20,40,50],
    'min_samples_split': [2, 3],
        'min_samples_leaf':[3,1,5],
    'class_weight': ['balanced'],
    'bootstrap': [True],
        'max_features': [ 'sqrt', 'log2']

        },
    'ensemble Model':{
    'rf__n_estimators': [100, 200],
    'rf__max_depth': [None, 10, 20],
    'xgb__n_estimators': [100, 200],
    'xgb__max_depth': [3, 5, 7],
    'xgb__learning_rate': [0.01, 0.1],
},
   'ensemble Model w': {
    'rf__n_estimators': [100, 200],
    'rf__max_depth': [None, 10],
    'xgb__n_estimators': [100, 200],
    'xgb__max_depth': [3, 5],
    'xgb__learning_rate': [0.05, 0.1],
    'weights': [(1, 1), (2, 1), (1, 2), (3, 2)]
}
}

ML Models testing & performance exploration¶

Different ML models using glove embedding¶

In [61]:
# Testing glove embedding model performance with all the defined classifiers

glove_df,result_glove_df = checkModelPerformance('glove', False, {})
result_glove_df
 Dataframe: 
           0         1         2         3         4         5         6   \
118 -0.056184  0.177656  0.174576  0.045556 -0.193893 -0.088288 -0.105643   
271 -0.179080  0.114483  0.295197 -0.303266 -0.062244  0.130315 -0.194258   
114 -0.056861  0.334826 -0.014339 -0.029970 -0.211213 -0.038787 -0.245522   
311  0.021858  0.124828  0.058508 -0.061422  0.058099 -0.106459  0.002638   
131 -0.178758  0.053583  0.137200 -0.357772  0.165735  0.213804 -0.135574   

           7         8         9   ...        90        91        92  \
118  0.334605  0.019906  0.123581  ...  0.133245 -0.126663 -0.087192   
271  0.097784 -0.012458  0.107867  ... -0.326409 -0.127338 -0.050805   
114  0.246878 -0.142466  0.134186  ... -0.193840 -0.045271 -0.131900   
311  0.028145 -0.030354  0.155111  ... -0.213158  0.313407 -0.147785   
131  0.127943  0.127478  0.271656  ...  0.074759 -0.169716 -0.151178   

           93        94        95        96        97        98        99  
118  0.125817 -0.230977  0.184157 -0.034976 -0.188600  0.401641 -0.243753  
271 -0.031106 -0.247197  0.095136  0.041451  0.121288  0.379360 -0.059034  
114  0.101684 -0.040527 -0.120222 -0.063292 -0.240822  0.581549  0.078819  
311 -0.031894  0.174653  0.027815 -0.079510 -0.195726  0.454219 -0.001877  
131 -0.175557 -0.140331 -0.160881  0.158265  0.047664  0.363717 -0.020860  

[5 rows x 100 columns]

---------------- Random Forest ----------------
Cross-validation scores: [0.83050847 0.79661017 0.84482759 0.84482759 0.82758621]
Mean accuracy: 0.8288720046756284

=== Random Forest - Test Set ===
Accuracy : 0.8333333333333334
Precision (macro) : 0.2777777777777778
Precision (avg): 0.6944444444444444
Recall : 0.8333333333333334
F1-Score : 0.7575757575757576
No description has been provided for this image
---------------- xgboost ----------------
Cross-validation scores: [0.83050847 0.79661017 0.84482759 0.87931034 0.82758621]
Mean accuracy: 0.8357685563997663

=== xgboost - Test Set ===
Accuracy : 0.8571428571428571
Precision (macro) : 0.5128205128205128
Precision (avg): 0.827024827024827
Recall : 0.8571428571428571
F1-Score : 0.8303916161059018
No description has been provided for this image
---------------- ensemble Model ----------------
Cross-validation scores: [0.83050847 0.79661017 0.84482759 0.86206897 0.82758621]
Mean accuracy: 0.8323202805376972

=== ensemble Model - Test Set ===
Accuracy : 0.8492063492063492
Precision (macro) : 0.5341530054644809
Precision (avg): 0.8234777517564402
Recall : 0.8492063492063492
F1-Score : 0.8029204403462209
No description has been provided for this image
---------------- ensemble Model w ----------------
Cross-validation scores: [0.83050847 0.79661017 0.84482759 0.84482759 0.82758621]
Mean accuracy: 0.8288720046756284

=== ensemble Model w - Test Set ===
Accuracy : 0.8492063492063492
Precision (macro) : 0.6155913978494624
Precision (avg): 0.8564388120839734
Recall : 0.8492063492063492
F1-Score : 0.7929147398891628
No description has been provided for this image
Out[61]:
Classifier Train Accuracy Train F1 Score Train Recall Train Precision Training Time Test Accuracy Test F1 Score Test Recall Test Precision Test Time cv score
0 Random Forest 0.996575 0.996592 0.996575 0.996655 1.751820 0.833333 0.757576 0.833333 0.694444 0.011678 0.828872
1 xgboost 0.996575 0.996558 0.996575 0.996589 4.207625 0.857143 0.830392 0.857143 0.827025 0.019301 0.835769
2 ensemble Model 0.996575 0.996592 0.996575 0.996655 7.796061 0.849206 0.802920 0.849206 0.823478 0.026343 0.832320
3 ensemble Model w 0.996575 0.996592 0.996575 0.996655 5.233331 0.849206 0.792915 0.849206 0.856439 0.027977 0.828872

Different ML models using TF-IDF embedding¶

In [62]:
# Testing tfidf embedding model performance with all the defined classifiers

tfidf_df, result_tfidf_df= checkModelPerformance('tfidf', False, {})
result_tfidf_df
 Dataframe: 
           0         1         2         3         4         5         6   \
118  0.188436  0.099968 -0.076000 -0.015923 -0.099695 -0.047942 -0.071160   
271  0.091592 -0.116181 -0.021995  0.034526  0.004983  0.110230 -0.049610   
114  0.134943  0.066018 -0.004363 -0.013355  0.027673 -0.040776 -0.026864   
311  0.183889 -0.184939  0.110300 -0.070630  0.142109  0.025369  0.028994   
131  0.101048 -0.187166  0.026439 -0.004623  0.049992  0.124609 -0.044363   

           7         8         9   ...        90        91        92  \
118 -0.073575 -0.076155 -0.052180  ...  0.077940  0.066760 -0.049240   
271  0.043311 -0.027260 -0.100401  ... -0.026688 -0.039597 -0.043832   
114 -0.069484 -0.002781 -0.037861  ...  0.050448  0.017158  0.152878   
311  0.010995  0.045685 -0.092659  ... -0.073875  0.025463  0.033079   
131  0.005140 -0.076758  0.090388  ... -0.151817 -0.069046  0.061397   

           93        94        95        96        97        98        99  
118  0.051502  0.029736 -0.045489  0.076022  0.087375  0.050234 -0.064669  
271  0.101309 -0.074299  0.011897 -0.135090 -0.095252 -0.015050  0.013355  
114  0.025227  0.021670 -0.002196 -0.104157  0.060381 -0.001387  0.067561  
311  0.007076  0.015927 -0.000240  0.011134  0.118085  0.059991 -0.088090  
131  0.111561  0.003630  0.011091 -0.043785  0.036522 -0.076769  0.173746  

[5 rows x 100 columns]

---------------- Random Forest ----------------
Cross-validation scores: [0.83050847 0.79661017 0.84482759 0.84482759 0.82758621]
Mean accuracy: 0.8288720046756284

=== Random Forest - Test Set ===
Accuracy : 0.8333333333333334
Precision (macro) : 0.2777777777777778
Precision (avg): 0.6944444444444444
Recall : 0.8333333333333334
F1-Score : 0.7575757575757576
No description has been provided for this image
---------------- xgboost ----------------
Cross-validation scores: [0.84745763 0.76271186 0.82758621 0.84482759 0.81034483]
Mean accuracy: 0.8185856224430157

=== xgboost - Test Set ===
Accuracy : 0.8333333333333334
Precision (macro) : 0.4480874316939891
Precision (avg): 0.7789487379651314
Recall : 0.8333333333333334
F1-Score : 0.7824658050668393
No description has been provided for this image
---------------- ensemble Model ----------------
Cross-validation scores: [0.83050847 0.79661017 0.84482759 0.84482759 0.82758621]
Mean accuracy: 0.8288720046756284

=== ensemble Model - Test Set ===
Accuracy : 0.8412698412698413
Precision (macro) : 0.6133333333333333
Precision (avg): 0.8507936507936508
Recall : 0.8412698412698413
F1-Score : 0.7759489302967564
No description has been provided for this image
---------------- ensemble Model w ----------------
Cross-validation scores: [0.83050847 0.79661017 0.84482759 0.84482759 0.82758621]
Mean accuracy: 0.8288720046756284

=== ensemble Model w - Test Set ===
Accuracy : 0.8333333333333334
Precision (macro) : 0.2777777777777778
Precision (avg): 0.6944444444444444
Recall : 0.8333333333333334
F1-Score : 0.7575757575757576
No description has been provided for this image
Out[62]:
Classifier Train Accuracy Train F1 Score Train Recall Train Precision Training Time Test Accuracy Test F1 Score Test Recall Test Precision Test Time cv score
0 Random Forest 0.993151 0.993216 0.993151 0.993462 1.807230 0.833333 0.757576 0.833333 0.694444 0.008217 0.828872
1 xgboost 0.993151 0.993081 0.993151 0.993206 5.023685 0.833333 0.782466 0.833333 0.778949 0.018924 0.818586
2 ensemble Model 0.993151 0.993216 0.993151 0.993462 5.301445 0.841270 0.775949 0.841270 0.850794 0.028287 0.828872
3 ensemble Model w 0.993151 0.993216 0.993151 0.993462 6.423100 0.833333 0.757576 0.833333 0.694444 0.074835 0.828872

Different hyperparameter tuned ML models using glove embedding¶

In [63]:
# Testing glove embedding model performance with all the defined classifiers using hyperparameter tuning

df_glove,result_glove_grid_df = checkModelPerformance('glove', True, paramGrids)
result_glove_grid_df
 Dataframe: 
           0         1         2         3         4         5         6   \
118 -0.056184  0.177656  0.174576  0.045556 -0.193893 -0.088288 -0.105643   
271 -0.179080  0.114483  0.295197 -0.303266 -0.062244  0.130315 -0.194258   
114 -0.056861  0.334826 -0.014339 -0.029970 -0.211213 -0.038787 -0.245522   
311  0.021858  0.124828  0.058508 -0.061422  0.058099 -0.106459  0.002638   
131 -0.178758  0.053583  0.137200 -0.357772  0.165735  0.213804 -0.135574   

           7         8         9   ...        90        91        92  \
118  0.334605  0.019906  0.123581  ...  0.133245 -0.126663 -0.087192   
271  0.097784 -0.012458  0.107867  ... -0.326409 -0.127338 -0.050805   
114  0.246878 -0.142466  0.134186  ... -0.193840 -0.045271 -0.131900   
311  0.028145 -0.030354  0.155111  ... -0.213158  0.313407 -0.147785   
131  0.127943  0.127478  0.271656  ...  0.074759 -0.169716 -0.151178   

           93        94        95        96        97        98        99  
118  0.125817 -0.230977  0.184157 -0.034976 -0.188600  0.401641 -0.243753  
271 -0.031106 -0.247197  0.095136  0.041451  0.121288  0.379360 -0.059034  
114  0.101684 -0.040527 -0.120222 -0.063292 -0.240822  0.581549  0.078819  
311 -0.031894  0.174653  0.027815 -0.079510 -0.195726  0.454219 -0.001877  
131 -0.175557 -0.140331 -0.160881  0.158265  0.047664  0.363717 -0.020860  

[5 rows x 100 columns]

---------------- Random Forest ----------------
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Grid Search Best Estimator: RandomForestClassifier(class_weight='balanced', max_depth=40,
                       max_features='log2', min_samples_leaf=5,
                       min_samples_split=3, n_estimators=420)
Grid Search Best Parameters: {'n_estimators': 420, 'min_samples_split': 3, 'min_samples_leaf': 5, 'max_features': 'log2', 'max_depth': 40, 'class_weight': 'balanced', 'bootstrap': True}
Grid Search Best Score: 0.8322618351841029

=== Random Forest - Test Set ===
Accuracy : 0.8412698412698413
Precision (macro) : 0.6133333333333333
Precision (avg): 0.8507936507936508
Recall : 0.8412698412698413
F1-Score : 0.7759489302967564
No description has been provided for this image
---------------- xgboost ----------------
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Grid Search Best Estimator: XGBClassifier(base_score=None, booster=None, callbacks=None,
              class_weight='balanced', colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=0.6, device=None,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric='mlogloss', feature_types=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=4, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=230, n_jobs=None,
              num_parallel_tree=None, ...)
Grid Search Best Parameters: {'subsample': 0.6, 'scale_pos_weight': 'auto_weight_for_imbalance', 'n_estimators': 230, 'max_depth': 4, 'learning_rate': 0.1, 'colsample_bytree': 0.6, 'class_weight': 'balanced'}
Grid Search Best Score: 0.8356516656925773

=== xgboost - Test Set ===
Accuracy : 0.8412698412698413
Precision (macro) : 0.5040650406504065
Precision (avg): 0.8051361465995612
Recall : 0.8412698412698413
F1-Score : 0.7876509455456824
No description has been provided for this image
---------------- ensemble Model ----------------
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Grid Search Best Estimator: VotingClassifier(estimators=[('rf',
                              RandomForestClassifier(class_weight='balanced',
                                                     max_depth=20)),
                             ('xgb',
                              XGBClassifier(base_score=None, booster=None,
                                            callbacks=None,
                                            class_weight='balanced',
                                            colsample_bylevel=None,
                                            colsample_bynode=None,
                                            colsample_bytree=None, device=None,
                                            early_stopping_rounds=None,
                                            enable_categorical=False,
                                            eval_metric='mlogloss',
                                            feature_ty...None,
                                            grow_policy=None,
                                            importance_type=None,
                                            interaction_constraints=None,
                                            learning_rate=0.1, max_bin=None,
                                            max_cat_threshold=None,
                                            max_cat_to_onehot=None,
                                            max_delta_step=None, max_depth=3,
                                            max_leaves=None,
                                            min_child_weight=None, missing=nan,
                                            monotone_constraints=None,
                                            multi_strategy=None,
                                            n_estimators=100, n_jobs=None,
                                            num_parallel_tree=None, ...))],
                 voting='soft')
Grid Search Best Parameters: {'xgb__n_estimators': 100, 'xgb__max_depth': 3, 'xgb__learning_rate': 0.1, 'rf__n_estimators': 100, 'rf__max_depth': 20}
Grid Search Best Score: 0.8356516656925775

=== ensemble Model - Test Set ===
Accuracy : 0.8492063492063492
Precision (macro) : 0.6155913978494624
Precision (avg): 0.8564388120839734
Recall : 0.8492063492063492
F1-Score : 0.7929147398891628
No description has been provided for this image
---------------- ensemble Model w ----------------
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Grid Search Best Estimator: VotingClassifier(estimators=[('rf',
                              RandomForestClassifier(class_weight='balanced',
                                                     n_estimators=200)),
                             ('xgb',
                              XGBClassifier(base_score=None, booster=None,
                                            callbacks=None,
                                            class_weight='balanced',
                                            colsample_bylevel=None,
                                            colsample_bynode=None,
                                            colsample_bytree=None, device=None,
                                            early_stopping_rounds=None,
                                            enable_categorical=False,
                                            eval_metric='mlogloss',
                                            featur...
                                            grow_policy=None,
                                            importance_type=None,
                                            interaction_constraints=None,
                                            learning_rate=0.1, max_bin=None,
                                            max_cat_threshold=None,
                                            max_cat_to_onehot=None,
                                            max_delta_step=None, max_depth=5,
                                            max_leaves=None,
                                            min_child_weight=None, missing=nan,
                                            monotone_constraints=None,
                                            multi_strategy=None,
                                            n_estimators=200, n_jobs=None,
                                            num_parallel_tree=None, ...))],
                 voting='soft', weights=(1, 2))
Grid Search Best Parameters: {'xgb__n_estimators': 200, 'xgb__max_depth': 5, 'xgb__learning_rate': 0.1, 'weights': (1, 2), 'rf__n_estimators': 200, 'rf__max_depth': None}
Grid Search Best Score: 0.8357101110461718

=== ensemble Model w - Test Set ===
Accuracy : 0.8492063492063492
Precision (macro) : 0.6155913978494624
Precision (avg): 0.8564388120839734
Recall : 0.8492063492063492
F1-Score : 0.7929147398891628
No description has been provided for this image
Out[63]:
Classifier Train Accuracy Train F1 Score Train Recall Train Precision Training Time Test Accuracy Test F1 Score Test Recall Test Precision Test Time cv score
0 Random Forest 0.996575 0.996592 0.996575 0.996655 72.266037 0.841270 0.775949 0.841270 0.850794 0.034887 0.832262
1 xgboost 0.996575 0.996558 0.996575 0.996589 161.761228 0.841270 0.787651 0.841270 0.805136 0.018924 0.835652
2 ensemble Model 0.996575 0.996592 0.996575 0.996655 156.855411 0.849206 0.792915 0.849206 0.856439 0.024351 0.835652
3 ensemble Model w 0.996575 0.996592 0.996575 0.996655 128.998138 0.849206 0.792915 0.849206 0.856439 0.036261 0.835710

Different hyperparameter tuned ML models using TF-IDF embedding¶

In [64]:
# Testing tfidf embedding model performance with all the defined classifiers using hyperparameter tuning

tfidf_df,result_grid_tfidf_df = checkModelPerformance('tfidf', True, paramGrids)
result_grid_tfidf_df
 Dataframe: 
           0         1         2         3         4         5         6   \
118  0.188436  0.099968 -0.076000 -0.015923 -0.099695 -0.047942 -0.071160   
271  0.091592 -0.116181 -0.021995  0.034526  0.004983  0.110230 -0.049610   
114  0.134943  0.066018 -0.004363 -0.013355  0.027673 -0.040776 -0.026864   
311  0.183889 -0.184939  0.110300 -0.070630  0.142109  0.025369  0.028994   
131  0.101048 -0.187166  0.026439 -0.004623  0.049992  0.124609 -0.044363   

           7         8         9   ...        90        91        92  \
118 -0.073575 -0.076155 -0.052180  ...  0.077940  0.066760 -0.049240   
271  0.043311 -0.027260 -0.100401  ... -0.026688 -0.039597 -0.043832   
114 -0.069484 -0.002781 -0.037861  ...  0.050448  0.017158  0.152878   
311  0.010995  0.045685 -0.092659  ... -0.073875  0.025463  0.033079   
131  0.005140 -0.076758  0.090388  ... -0.151817 -0.069046  0.061397   

           93        94        95        96        97        98        99  
118  0.051502  0.029736 -0.045489  0.076022  0.087375  0.050234 -0.064669  
271  0.101309 -0.074299  0.011897 -0.135090 -0.095252 -0.015050  0.013355  
114  0.025227  0.021670 -0.002196 -0.104157  0.060381 -0.001387  0.067561  
311  0.007076  0.015927 -0.000240  0.011134  0.118085  0.059991 -0.088090  
131  0.111561  0.003630  0.011091 -0.043785  0.036522 -0.076769  0.173746  

[5 rows x 100 columns]

---------------- Random Forest ----------------
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Grid Search Best Estimator: RandomForestClassifier(class_weight='balanced', max_depth=50,
                       min_samples_leaf=3, min_samples_split=3,
                       n_estimators=370)
Grid Search Best Parameters: {'n_estimators': 370, 'min_samples_split': 3, 'min_samples_leaf': 3, 'max_features': 'sqrt', 'max_depth': 50, 'class_weight': 'balanced', 'bootstrap': True}
Grid Search Best Score: 0.8288720046756284

=== Random Forest - Test Set ===
Accuracy : 0.8333333333333334
Precision (macro) : 0.2777777777777778
Precision (avg): 0.6944444444444444
Recall : 0.8333333333333334
F1-Score : 0.7575757575757576
No description has been provided for this image
---------------- xgboost ----------------
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Grid Search Best Estimator: XGBClassifier(base_score=None, booster=None, callbacks=None,
              class_weight='balanced', colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=0.6, device=None,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric='mlogloss', feature_types=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.01, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=6, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=270, n_jobs=None,
              num_parallel_tree=None, ...)
Grid Search Best Parameters: {'subsample': 0.6, 'scale_pos_weight': 'auto_weight_for_imbalance', 'n_estimators': 270, 'max_depth': 6, 'learning_rate': 0.01, 'colsample_bytree': 0.6, 'class_weight': 'balanced'}
Grid Search Best Score: 0.8288720046756284

=== xgboost - Test Set ===
Accuracy : 0.8333333333333334
Precision (macro) : 0.2777777777777778
Precision (avg): 0.6944444444444444
Recall : 0.8333333333333334
F1-Score : 0.7575757575757576
No description has been provided for this image
---------------- ensemble Model ----------------
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Grid Search Best Estimator: VotingClassifier(estimators=[('rf',
                              RandomForestClassifier(class_weight='balanced')),
                             ('xgb',
                              XGBClassifier(base_score=None, booster=None,
                                            callbacks=None,
                                            class_weight='balanced',
                                            colsample_bylevel=None,
                                            colsample_bynode=None,
                                            colsample_bytree=None, device=None,
                                            early_stopping_rounds=None,
                                            enable_categorical=False,
                                            eval_metric='mlogloss',
                                            feature_types=None, gamma=None,
                                            grow_policy=None,
                                            importance_type=None,
                                            interaction_constraints=None,
                                            learning_rate=0.01, max_bin=None,
                                            max_cat_threshold=None,
                                            max_cat_to_onehot=None,
                                            max_delta_step=None, max_depth=7,
                                            max_leaves=None,
                                            min_child_weight=None, missing=nan,
                                            monotone_constraints=None,
                                            multi_strategy=None,
                                            n_estimators=100, n_jobs=None,
                                            num_parallel_tree=None, ...))],
                 voting='soft')
Grid Search Best Parameters: {'xgb__n_estimators': 100, 'xgb__max_depth': 7, 'xgb__learning_rate': 0.01, 'rf__n_estimators': 100, 'rf__max_depth': None}
Grid Search Best Score: 0.8288720046756284

=== ensemble Model - Test Set ===
Accuracy : 0.8333333333333334
Precision (macro) : 0.2777777777777778
Precision (avg): 0.6944444444444444
Recall : 0.8333333333333334
F1-Score : 0.7575757575757576
No description has been provided for this image
---------------- ensemble Model w ----------------
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Grid Search Best Estimator: VotingClassifier(estimators=[('rf',
                              RandomForestClassifier(class_weight='balanced',
                                                     n_estimators=200)),
                             ('xgb',
                              XGBClassifier(base_score=None, booster=None,
                                            callbacks=None,
                                            class_weight='balanced',
                                            colsample_bylevel=None,
                                            colsample_bynode=None,
                                            colsample_bytree=None, device=None,
                                            early_stopping_rounds=None,
                                            enable_categorical=False,
                                            eval_metric='mlogloss',
                                            featur...
                                            grow_policy=None,
                                            importance_type=None,
                                            interaction_constraints=None,
                                            learning_rate=0.05, max_bin=None,
                                            max_cat_threshold=None,
                                            max_cat_to_onehot=None,
                                            max_delta_step=None, max_depth=3,
                                            max_leaves=None,
                                            min_child_weight=None, missing=nan,
                                            monotone_constraints=None,
                                            multi_strategy=None,
                                            n_estimators=100, n_jobs=None,
                                            num_parallel_tree=None, ...))],
                 voting='soft', weights=(2, 1))
Grid Search Best Parameters: {'xgb__n_estimators': 100, 'xgb__max_depth': 3, 'xgb__learning_rate': 0.05, 'weights': (2, 1), 'rf__n_estimators': 200, 'rf__max_depth': None}
Grid Search Best Score: 0.8288720046756284

=== ensemble Model w - Test Set ===
Accuracy : 0.8333333333333334
Precision (macro) : 0.2777777777777778
Precision (avg): 0.6944444444444444
Recall : 0.8333333333333334
F1-Score : 0.7575757575757576
No description has been provided for this image
Out[64]:
Classifier Train Accuracy Train F1 Score Train Recall Train Precision Training Time Test Accuracy Test F1 Score Test Recall Test Precision Test Time cv score
0 Random Forest 0.993151 0.993216 0.993151 0.993462 71.258850 0.833333 0.757576 0.833333 0.694444 0.031786 0.828872
1 xgboost 0.986301 0.985589 0.986301 0.986522 160.239827 0.833333 0.757576 0.833333 0.694444 0.020695 0.828872
2 ensemble Model 0.993151 0.993151 0.993151 0.993151 152.209462 0.833333 0.757576 0.833333 0.694444 0.034492 0.828872
3 ensemble Model w 0.993151 0.993216 0.993151 0.993462 124.776594 0.833333 0.757576 0.833333 0.694444 0.033019 0.828872

Neural Network¶


Data Preparation¶

Text summarization using facebook/bart-large-cnn¶

In [65]:
from tqdm import tqdm
tqdm.pandas()


fbsummarizer = pipeline("summarization", model="facebook/bart-large-cnn")
nlp = spacy.load("en_core_web_sm")


def extract_action_objects(text):

    doc = nlp(text)
    actions = []
    for token in doc:
        if token.pos_ in ["VERB", "NOUN", "ADJ"] and token.dep_ not in ("nsubj", "det", "poss"):
            actions.append(token.lemma_)
    return " ".join(actions)

def summarize_and_strip_subjects(text):
    prompt = "summarize: " + text
    summary = fbsummarizer(prompt, max_length=50, min_length=15, do_sample=False)[0]['summary_text']
    return extract_action_objects(summary)



encoded_df_ish['summary'] = encoded_df_ish['Description'].progress_apply(summarize_and_strip_subjects)



with open('/content/drive/MyDrive/Capstone/encoded_df_ish.pkl', 'wb') as file:
    pickle.dump(encoded_df_ish, file)
Device set to use cuda:0
  2%|▏         | 9/418 [00:07<05:06,  1.33it/s]Your max_length is set to 50, but your input_length is only 34. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
  3%|▎         | 11/418 [00:09<04:38,  1.46it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
  4%|▍         | 18/418 [00:13<04:34,  1.46it/s]Your max_length is set to 50, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)
  7%|▋         | 29/418 [00:21<04:08,  1.56it/s]Your max_length is set to 50, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)
  7%|▋         | 30/418 [00:22<04:16,  1.52it/s]Your max_length is set to 50, but your input_length is only 33. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)
  8%|▊         | 35/418 [00:26<04:24,  1.45it/s]Your max_length is set to 50, but your input_length is only 24. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)
 10%|▉         | 41/418 [00:30<04:44,  1.32it/s]Your max_length is set to 50, but your input_length is only 48. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)
 13%|█▎        | 55/418 [00:39<04:14,  1.43it/s]Your max_length is set to 50, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
 14%|█▍        | 58/418 [00:41<04:44,  1.27it/s]Your max_length is set to 50, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)
 14%|█▍        | 59/418 [00:42<04:43,  1.27it/s]Your max_length is set to 50, but your input_length is only 44. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)
 15%|█▌        | 64/418 [00:45<03:16,  1.81it/s]Your max_length is set to 50, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
 16%|█▌        | 65/418 [00:45<03:00,  1.96it/s]Your max_length is set to 50, but your input_length is only 34. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
 16%|█▌        | 66/418 [00:46<02:56,  1.99it/s]Your max_length is set to 50, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
 17%|█▋        | 72/418 [00:50<03:49,  1.51it/s]Your max_length is set to 50, but your input_length is only 47. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=23)
 17%|█▋        | 73/418 [00:50<03:27,  1.66it/s]Your max_length is set to 50, but your input_length is only 28. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=14)
 20%|██        | 84/418 [00:59<03:56,  1.41it/s]Your max_length is set to 50, but your input_length is only 27. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)
 21%|██▏       | 89/418 [01:01<03:20,  1.64it/s]Your max_length is set to 50, but your input_length is only 47. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=23)
 23%|██▎       | 96/418 [01:07<04:08,  1.30it/s]Your max_length is set to 50, but your input_length is only 43. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=21)
 24%|██▎       | 99/418 [01:09<04:27,  1.19it/s]Your max_length is set to 50, but your input_length is only 42. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=21)
 24%|██▍       | 100/418 [01:10<04:16,  1.24it/s]Your max_length is set to 50, but your input_length is only 27. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)
 25%|██▍       | 103/418 [01:12<03:38,  1.44it/s]Your max_length is set to 50, but your input_length is only 36. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)
 26%|██▌       | 107/418 [01:14<02:36,  1.99it/s]Your max_length is set to 50, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
 27%|██▋       | 111/418 [01:15<02:23,  2.14it/s]Your max_length is set to 50, but your input_length is only 48. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)
 27%|██▋       | 112/418 [01:16<02:11,  2.33it/s]Your max_length is set to 50, but your input_length is only 47. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=23)
 27%|██▋       | 113/418 [01:16<02:03,  2.48it/s]Your max_length is set to 50, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
 27%|██▋       | 114/418 [01:17<02:14,  2.26it/s]Your max_length is set to 50, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)
 29%|██▉       | 121/418 [01:22<03:56,  1.26it/s]Your max_length is set to 50, but your input_length is only 37. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)
 29%|██▉       | 122/418 [01:23<03:32,  1.39it/s]Your max_length is set to 50, but your input_length is only 34. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
 29%|██▉       | 123/418 [01:24<03:31,  1.40it/s]Your max_length is set to 50, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)
 32%|███▏      | 132/418 [01:29<03:13,  1.48it/s]Your max_length is set to 50, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
 32%|███▏      | 133/418 [01:30<03:13,  1.47it/s]Your max_length is set to 50, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
 35%|███▍      | 145/418 [01:39<03:16,  1.39it/s]Your max_length is set to 50, but your input_length is only 44. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)
 38%|███▊      | 158/418 [01:47<03:31,  1.23it/s]Your max_length is set to 50, but your input_length is only 29. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=14)
 41%|████▏     | 173/418 [01:57<02:24,  1.70it/s]Your max_length is set to 50, but your input_length is only 45. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)
 42%|████▏     | 176/418 [01:59<02:14,  1.80it/s]Your max_length is set to 50, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
 43%|████▎     | 178/418 [02:00<02:26,  1.64it/s]Your max_length is set to 50, but your input_length is only 33. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)
 43%|████▎     | 181/418 [02:03<02:51,  1.38it/s]Your max_length is set to 50, but your input_length is only 47. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=23)
 44%|████▎     | 182/418 [02:03<02:26,  1.62it/s]Your max_length is set to 50, but your input_length is only 34. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
 44%|████▍     | 183/418 [02:03<02:06,  1.86it/s]Your max_length is set to 50, but your input_length is only 36. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)
 44%|████▍     | 186/418 [02:05<02:20,  1.65it/s]Your max_length is set to 50, but your input_length is only 47. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=23)
 45%|████▍     | 187/418 [02:05<02:02,  1.88it/s]Your max_length is set to 50, but your input_length is only 46. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=23)
 45%|████▍     | 188/418 [02:06<02:10,  1.77it/s]Your max_length is set to 50, but your input_length is only 34. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
 46%|████▌     | 191/418 [02:08<02:22,  1.60it/s]Your max_length is set to 50, but your input_length is only 34. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
 47%|████▋     | 195/418 [02:11<02:29,  1.50it/s]Your max_length is set to 50, but your input_length is only 44. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)
 48%|████▊     | 199/418 [02:14<02:47,  1.30it/s]Your max_length is set to 50, but your input_length is only 47. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=23)
 48%|████▊     | 200/418 [02:15<03:02,  1.20it/s]Your max_length is set to 50, but your input_length is only 47. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=23)
 48%|████▊     | 201/418 [02:15<02:30,  1.45it/s]Your max_length is set to 50, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
 49%|████▊     | 203/418 [02:16<02:14,  1.59it/s]Your max_length is set to 50, but your input_length is only 40. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)
 50%|████▉     | 208/418 [02:19<02:13,  1.57it/s]Your max_length is set to 50, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
 50%|█████     | 209/418 [02:20<01:55,  1.82it/s]Your max_length is set to 50, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
 51%|█████     | 213/418 [02:22<02:12,  1.55it/s]Your max_length is set to 50, but your input_length is only 43. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=21)
 52%|█████▏    | 218/418 [02:25<02:17,  1.46it/s]Your max_length is set to 50, but your input_length is only 48. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)
 53%|█████▎    | 221/418 [02:28<02:49,  1.16it/s]Your max_length is set to 50, but your input_length is only 37. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)
 55%|█████▌    | 230/418 [02:34<01:46,  1.77it/s]Your max_length is set to 50, but your input_length is only 26. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)
 56%|█████▋    | 236/418 [02:38<01:59,  1.52it/s]Your max_length is set to 50, but your input_length is only 36. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)
 58%|█████▊    | 241/418 [02:42<02:22,  1.25it/s]Your max_length is set to 50, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)
 58%|█████▊    | 242/418 [02:42<02:05,  1.40it/s]Your max_length is set to 50, but your input_length is only 40. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)
 60%|█████▉    | 250/418 [02:47<01:45,  1.60it/s]Your max_length is set to 50, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
 61%|██████▏   | 257/418 [02:51<01:37,  1.65it/s]Your max_length is set to 50, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)
 64%|██████▍   | 267/418 [02:57<01:30,  1.68it/s]Your max_length is set to 50, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)
 65%|██████▍   | 270/418 [02:59<01:26,  1.71it/s]Your max_length is set to 50, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)
 65%|██████▌   | 272/418 [03:00<01:31,  1.60it/s]Your max_length is set to 50, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
 66%|██████▌   | 274/418 [03:02<01:30,  1.60it/s]Your max_length is set to 50, but your input_length is only 25. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)
 66%|██████▋   | 277/418 [03:03<01:18,  1.79it/s]Your max_length is set to 50, but your input_length is only 33. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)
 68%|██████▊   | 286/418 [03:09<01:30,  1.45it/s]Your max_length is set to 50, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)
 69%|██████▊   | 287/418 [03:10<01:24,  1.56it/s]Your max_length is set to 50, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
 70%|███████   | 294/418 [03:14<01:18,  1.58it/s]Your max_length is set to 50, but your input_length is only 43. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=21)
 71%|███████   | 297/418 [03:16<01:16,  1.59it/s]Your max_length is set to 50, but your input_length is only 29. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=14)
 71%|███████▏  | 298/418 [03:16<01:05,  1.83it/s]Your max_length is set to 50, but your input_length is only 45. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)
 72%|███████▏  | 300/418 [03:18<01:15,  1.57it/s]Your max_length is set to 50, but your input_length is only 45. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)
 72%|███████▏  | 303/418 [03:20<01:23,  1.37it/s]Your max_length is set to 50, but your input_length is only 26. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=13)
 74%|███████▍  | 309/418 [03:24<01:13,  1.49it/s]Your max_length is set to 50, but your input_length is only 34. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
 75%|███████▍  | 312/418 [03:25<01:04,  1.65it/s]Your max_length is set to 50, but your input_length is only 42. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=21)
 76%|███████▌  | 317/418 [03:29<00:59,  1.68it/s]Your max_length is set to 50, but your input_length is only 35. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
 76%|███████▌  | 318/418 [03:29<00:53,  1.88it/s]Your max_length is set to 50, but your input_length is only 43. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=21)
 77%|███████▋  | 323/418 [03:32<01:06,  1.42it/s]Your max_length is set to 50, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
 80%|████████  | 335/418 [03:40<00:48,  1.71it/s]Your max_length is set to 50, but your input_length is only 44. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)
 85%|████████▍ | 355/418 [03:54<00:42,  1.48it/s]Your max_length is set to 50, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
 85%|████████▌ | 357/418 [03:55<00:40,  1.50it/s]Your max_length is set to 50, but your input_length is only 48. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)
 86%|████████▌ | 359/418 [03:56<00:31,  1.85it/s]Your max_length is set to 50, but your input_length is only 28. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=14)
 92%|█████████▏| 385/418 [04:15<00:25,  1.31it/s]Your max_length is set to 50, but your input_length is only 45. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)
 93%|█████████▎| 387/418 [04:17<00:19,  1.56it/s]Your max_length is set to 50, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)
 94%|█████████▍| 393/418 [04:20<00:15,  1.66it/s]Your max_length is set to 50, but your input_length is only 34. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)
 94%|█████████▍| 394/418 [04:20<00:13,  1.84it/s]Your max_length is set to 50, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
 95%|█████████▍| 397/418 [04:22<00:12,  1.64it/s]Your max_length is set to 50, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
 95%|█████████▌| 398/418 [04:23<00:10,  1.85it/s]Your max_length is set to 50, but your input_length is only 40. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)
 98%|█████████▊| 408/418 [04:30<00:07,  1.35it/s]Your max_length is set to 50, but your input_length is only 47. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=23)
 98%|█████████▊| 409/418 [04:31<00:06,  1.39it/s]Your max_length is set to 50, but your input_length is only 37. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=18)
 98%|█████████▊| 411/418 [04:32<00:04,  1.52it/s]Your max_length is set to 50, but your input_length is only 45. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)
100%|█████████▉| 417/418 [04:36<00:00,  1.57it/s]Your max_length is set to 50, but your input_length is only 44. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)
100%|██████████| 418/418 [04:37<00:00,  1.59it/s]Your max_length is set to 50, but your input_length is only 49. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=24)
100%|██████████| 418/418 [04:37<00:00,  1.50it/s]
In [66]:
with open('/content/drive/MyDrive/Capstone/encoded_df_ish.pkl', 'rb') as file:
      encoded_df_ish = pickle.load(file)
In [67]:
encoded_df_ish.loc[5, ['Description', 'summary']]
Out[67]:
5
Description During the unloading operation of the ustulado...
summary unloading operation be need unclog discharge m...

In [68]:
def ngram_func_per_level(df, text_col, label_col, n=3, top_k=30):

    from collections import Counter
    import pandas as pd
    import nltk

    records = []

    for level in df[label_col].unique():
        texts = df[df[label_col] == level][text_col]
        all_words = " ".join(texts).split()
        ngram_counts = nltk.FreqDist(nltk.ngrams(all_words, n)).most_common(top_k)

        for ngram, count in ngram_counts:
            records.append({
                'accident_level': level,
                'ngram': " ".join(ngram),
                'count': count
            })

    return pd.DataFrame(records)

raw_ish_df = pd.DataFrame({
    'Description_processed': encoded_df_ish['Description_processed'],
    'Accident Level': encoded_df_ish['Accident Level']

})

# Create the trigram DataFrame by accident level
trigram_level_df = ngram_func_per_level(encoded_df_ish, 'summary', 'Accident Level', n=3)

# Plot
plt.figure(figsize=(18, 12))
sns.barplot(
    data=trigram_level_df,
    x='ngram',
    y='count',
    hue='accident_level',
    palette='Set1'
)
plt.title('Top Trigrams by Accident Level')
plt.xlabel('Frequency')
plt.ylabel('Trigram')
plt.xticks(rotation=90)
plt.legend(title='Accident Level')
plt.tight_layout()
plt.show()
No description has been provided for this image

Label Encoding on target variable¶

In [69]:
label_encoder = LabelEncoder()


y = label_encoder.fit_transform(encoded_df_ish['Accident Level'])
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print('label' )
print(label_mapping)

X=encoded_df_ish['summary'].astype(str).tolist()

X_raw, y_raw = X, y
label
{'I': 0, 'II': 1, 'III': 2, 'IV': 3, 'V': 4}

Train-Test Split¶

In [70]:
np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

y_train=pd.Series(y_train)


X_train_raw=X_train.copy()
X_test_raw=X_test.copy()
y_train_raw=y_train.copy()
y_test_raw=y_test.copy()

Train Data Augmentation¶

In [71]:
y_train.value_counts().plot(kind='bar', title="Before Augmentation")
y_train.value_counts(normalize=True)
Out[71]:
proportion
0 0.739521
1 0.095808
2 0.074850
3 0.071856
4 0.017964

No description has been provided for this image

Back Translation¶

In [72]:
def augment_minority_classes_backtranslate(df, num_aug=2):
    from deep_translator import GoogleTranslator
    from collections import Counter

    def back_translate(text, src='en', mid='fr'):
        return GoogleTranslator(source=src, target=mid).translate(
            GoogleTranslator(source=mid, target=src).translate(text)
        )

    augmented_texts = []
    augmented_labels = []

    label_counts = df['label'].value_counts()
    max_count = label_counts.max()

    for label in label_counts.index:
        class_df = df[df['label'] == label]
        if label_counts[label] < max_count:
            gap = max_count - label_counts[label]
            samples_needed = gap // num_aug + 1

            for i, row in class_df.sample(samples_needed, replace=True).iterrows():
                augmented = [back_translate(row['text']) for _ in range(num_aug)]
                augmented_texts.extend(augmented)
                augmented_labels.extend([label] * len(augmented))

    aug_df = pd.DataFrame({'text': augmented_texts, 'label': augmented_labels})
    return pd.concat([df, aug_df], ignore_index=True)



df_augmented_bt = augment_minority_classes_backtranslate(pd.DataFrame({'text':X_train, 'label':y_train}))
df_augmented_bt['label'].value_counts().plot(kind='bar')


with open('/content/drive/MyDrive/Capstone/df_augmented_bt.pkl', 'wb') as file:
    pickle.dump(df_augmented_bt, file)
No description has been provided for this image
In [73]:
with open('/content/drive/MyDrive/Capstone/df_augmented_bt.pkl', 'rb') as file:
      df_augmented_bt = pickle.load(file)

NLP Augmentation¶

In [74]:
# Assuming X_train_text and y_train are raw text and labels
!pip install nlpaug
import nltk
nltk.download('averaged_perceptron_tagger_eng')
from collections import Counter
import pandas as pd
import nlpaug.augmenter.word as naw


df_train = pd.DataFrame({'text': X_train, 'label': y_train})
aug = naw.SynonymAug(aug_src='wordnet')
augmented_data = []

# You can define a target sample count per class
target_count = 60
counts = Counter(y_train)

for label, count in counts.items():
    if count < target_count:
        df_class = df_train[df_train['label'] == label]
        needed = target_count - count
        generated = []
        for text in df_class['text']:
            for _ in range(needed // len(df_class) + 1):
                generated.append(aug.augment(text))
                if len(generated) >= needed:
                    break
        df_aug = pd.DataFrame({'text': generated[:needed], 'label': label})
        augmented_data.append(df_aug)

# Combine augmented data with original data
df_aug_all = pd.concat(augmented_data)
df_augmentated_nlp = pd.concat([df_train, df_aug_all]).reset_index(drop=True)

with open('/content/drive/MyDrive/Capstone/df_augmentated_nlp.pkl', 'wb') as file:
    pickle.dump(df_augmentated_nlp, file)
Requirement already satisfied: nlpaug in /usr/local/lib/python3.11/dist-packages (1.1.11)
Requirement already satisfied: numpy>=1.16.2 in /usr/local/lib/python3.11/dist-packages (from nlpaug) (1.26.4)
Requirement already satisfied: pandas>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from nlpaug) (2.2.2)
Requirement already satisfied: requests>=2.22.0 in /usr/local/lib/python3.11/dist-packages (from nlpaug) (2.32.3)
Requirement already satisfied: gdown>=4.0.0 in /usr/local/lib/python3.11/dist-packages (from nlpaug) (5.2.0)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/dist-packages (from gdown>=4.0.0->nlpaug) (4.13.4)
Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from gdown>=4.0.0->nlpaug) (3.18.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from gdown>=4.0.0->nlpaug) (4.67.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.2.0->nlpaug) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.2.0->nlpaug) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas>=1.2.0->nlpaug) (2025.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests>=2.22.0->nlpaug) (3.4.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests>=2.22.0->nlpaug) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests>=2.22.0->nlpaug) (2.4.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests>=2.22.0->nlpaug) (2025.6.15)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas>=1.2.0->nlpaug) (1.17.0)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown>=4.0.0->nlpaug) (2.7)
Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown>=4.0.0->nlpaug) (4.14.0)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown>=4.0.0->nlpaug) (1.7.1)
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
In [75]:
with open('/content/drive/MyDrive/Capstone/df_augmentated_nlp.pkl', 'rb') as file:
      df_augmentated_nlp = pickle.load(file)

Tokenization¶

Keras Tokenizer¶

In [76]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Constants
max_vocab_size = 4000
max_length = 58

# Step 1: Define the tokenize function
def tokenize_text(texts, tokenizer=None, max_length=max_length, vocab_size=max_vocab_size):
    # If tokenizer not provided, fit a new one
    if tokenizer is None:
        tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
        tokenizer.fit_on_texts(texts)

    # Convert to sequences
    sequences = tokenizer.texts_to_sequences(texts)

    # Pad sequences
    texts_tokenized = pad_sequences(sequences, maxlen=max_length, padding='post', truncating='post')

    return texts_tokenized, tokenizer


# Train tokenizer on main training set
X_train_tokenized, train_tokenizer = tokenize_text(X_train)

X_train_bt_tokenized, train_tokenizer = tokenize_text(df_augmented_bt['text'], tokenizer=train_tokenizer)
y_train_bt_tokenized = df_augmented_bt['label'].values

X_train_nlp_tokenized, train_tokenizer = tokenize_text(df_augmentated_nlp['text'], tokenizer=train_tokenizer)
y_train_nlp_tokenized = df_augmentated_nlp['label'].values

# Use the same tokenizer on test and augmented sets
X_test_tokenized, _ = tokenize_text(X_test, tokenizer=train_tokenizer)

Bert Tokenizer¶

In [77]:
from transformers import BertTokenizer, TFBertModel

# Load tokenizer and model
bert_model = TFBertModel.from_pretrained('bert-base-uncased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model.trainable = False

# Tokenize texts
def bert_tokenize(texts, max_len=58):
    tokens = bert_tokenizer(
        texts,
        padding='max_length',
        truncation=True,
        max_length=max_len,
        return_tensors='tf'
    )
    return tokens

X_train_bert_tokenized= bert_tokenize(X_train)

X_train_bt_bert_tokenized= bert_tokenize(df_augmented_bt['text'].astype(str).tolist())
y_train_bt_bert_tokenized= df_augmented_bt['label'].values

X_train_nlp_bert_tokenized= bert_tokenize(df_augmentated_nlp['text'].astype(str).tolist())
y_train_nlp_bert_tokenized= df_augmentated_nlp['label'].values


X_test_bert_tokenized= bert_tokenize(X_test)
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.

Embedding Matrixs¶

Glove embedding Matrix¶

In [78]:
import numpy as np
import pickle

def glove_embedding_matrix(tokenizer):
    try:
        with open('/content/drive/MyDrive/Capstone/glove_model.pkl', 'rb') as file:
            model_glove = pickle.load(file)
    except FileNotFoundError:
        # Fallback to loading GloVe from text file
        glove_file = '/content/drive/MyDrive/Capstone/glove.6B.100d.txt'
        model_glove = {}
        with open(glove_file, 'r', encoding='utf-8') as f:
            for line in f:
                values = line.strip().split()
                word = values[0]
                vector = np.array(values[1:], dtype='float32')
                model_glove[word] = vector

    vec_size = 100
    max_vocab_size = 10000

    if hasattr(tokenizer, 'vocab'):
        # BertTokenizer
        vocab_size = len(tokenizer.vocab)
        embedding_matrix = np.zeros((vocab_size, vec_size))
        oov_count = 0
        for token, i in tokenizer.vocab.items():
            clean_token = token.replace('##', '').lower()
            if clean_token in model_glove:
                embedding_matrix[i] = model_glove[clean_token]
            elif token.lower() in model_glove:
                embedding_matrix[i] = model_glove[token.lower()]
            else:
                embedding_matrix[i] = np.random.normal(0, 0.1, vec_size)
                oov_count += 1
        print(f"OOV tokens: {oov_count}/{vocab_size} ({oov_count/vocab_size:.2%})")
    else:
        # Keras Tokenizer
        vocab_size = min(max_vocab_size, len(tokenizer.word_index) + 1)
        embedding_matrix = np.zeros((vocab_size, vec_size))
        oov_count = 0
        for word, i in tokenizer.word_index.items():
            if i >= vocab_size:
                continue
            if word in model_glove:
                embedding_matrix[i] = model_glove[word]
            else:
                embedding_matrix[i] = np.random.normal(0, 0.1, vec_size)
                oov_count += 1
        print(f"OOV words: {oov_count}/{vocab_size-1} ({oov_count/(vocab_size-1):.2%})")

    print(f'Shape of glove embedding matrix : {embedding_matrix.shape}')
    return embedding_matrix

BERT Embedding¶

In [79]:
def get_bert_embedding(tokens):
  outputs = bert_model(tokens['input_ids'], attention_mask=tokens['attention_mask'])
  cls_embeddings = outputs.last_hidden_state[:, 0, :]
  return cls_embeddings.numpy()

def get_full_bert_hidden_states(tokens):
    outputs = bert_model(tokens['input_ids'], attention_mask=tokens['attention_mask'])
    return outputs.last_hidden_state

X_train_bert_embedding = get_bert_embedding(X_train_bert_tokenized)

X_train_bt_bert_embedding = get_bert_embedding(X_train_bt_bert_tokenized)
y_train_bt_bert_embedding = (y_train_bt_bert_tokenized)

X_train_nlp_bert_embedding = get_bert_embedding(X_train_nlp_bert_tokenized)
y_train_nlp_bert_embedding = y_train_nlp_bert_tokenized

X_test_bert_embedding= get_bert_embedding(X_test_bert_tokenized)


X_train_bert_hstate = get_full_bert_hidden_states(X_train_bert_tokenized)

X_train_bt_bert_hstate = get_full_bert_hidden_states(X_train_bt_bert_tokenized)
y_train_bt_bert_hstate = (y_train_bt_bert_tokenized)

X_train_nlp_bert_hstate = get_full_bert_hidden_states(X_train_nlp_bert_tokenized)
y_train_nlp_bert_hstate = y_train_nlp_bert_tokenized

X_test_bert_hstate= get_full_bert_hidden_states(X_test_bert_tokenized)

Utility functions for evaluating NN model performance¶

In [80]:
from sklearn.metrics import accuracy_score, classification_report, precision_recall_curve, auc
import matplotlib.pyplot as plt

def NN_performance_check(model, y_train, y_pred_train, y_test, y_pred, history, model_name="Model", embedding_name='embedding'):
    print(f"\n🔍 {model_name} Accuracy on Test Data: {accuracy_score(y_test, y_pred):.4f}")
    print("\nClassification Report:\n")
    print(classification_report(y_test, y_pred))

    # Confusion matrix (if needed)
    result_df_final = confusion_metrics(model, model_name, embedding_name, y_train, y_pred_train, y_test, y_pred)

    # Plot accuracy, loss, and precision-recall curves
    fig, axs = plt.subplots(1, 2, figsize=(15, 3))

    # Accuracy
    axs[0].plot(history.history['accuracy'], label='Train Accuracy')
    axs[0].plot(history.history['val_accuracy'], label='Validation Accuracy')
    axs[0].set_title(f'{model_name} Accuracy\n({embedding_name})')
    axs[0].set_xlabel('Epochs')
    axs[0].set_ylabel('Accuracy')
    axs[0].legend()
    axs[0].grid(True)

    # Loss
    axs[1].plot(history.history['loss'], label='Train Loss')
    axs[1].plot(history.history['val_loss'], label='Validation Loss')
    axs[1].set_title(f'{model_name} Loss\n({embedding_name})')
    axs[1].set_xlabel('Epochs')
    axs[1].set_ylabel('Loss')
    axs[1].legend()
    axs[1].grid(True)
    plt.tight_layout()
    plt.show()


    return result_df_final
In [81]:
import tensorflow.keras.backend as K
import tensorflow as tf
from sklearn.utils.class_weight import compute_class_weight


def sparse_categorical_focal_loss(gamma=2., alpha=0.75):
    def loss(y_true, y_pred):
        y_true = tf.cast(y_true, tf.int32)
        y_true = tf.reshape(y_true, [-1])  # Make sure it's flat
        y_true_one_hot = tf.one_hot(y_true, depth=tf.shape(y_pred)[-1])

        y_pred = tf.clip_by_value(y_pred, K.epsilon(), 1. - K.epsilon())
        cross_entropy = -y_true_one_hot * tf.math.log(y_pred)
        weight = alpha * tf.pow(1 - y_pred, gamma)
        focal_loss = weight * cross_entropy
        return tf.reduce_sum(focal_loss, axis=1)
    return loss
In [82]:
def macro_f1(y_true, y_pred):
    num_classes = tf.shape(y_pred)[-1]

    # Convert predictions to one-hot
    y_pred_labels = tf.argmax(y_pred, axis=-1)
    y_pred_onehot = tf.one_hot(y_pred_labels, depth=num_classes)

    # Convert true labels to one-hot
    y_true_onehot = tf.one_hot(tf.cast(y_true, tf.int32), depth=num_classes)

    tp = tf.reduce_sum(y_true_onehot * y_pred_onehot, axis=0)
    fp = tf.reduce_sum((1 - y_true_onehot) * y_pred_onehot, axis=0)
    fn = tf.reduce_sum(y_true_onehot * (1 - y_pred_onehot), axis=0)

    precision = tp / (tp + fp + K.epsilon())
    recall = tp / (tp + fn + K.epsilon())

    f1 = 2 * precision * recall / (precision + recall + K.epsilon())
    return tf.reduce_mean(f1)
In [83]:
result_df_final=result_df_final.iloc[0:0]
In [84]:
# import pickle

# # Load the pickle file
# with open("/content/drive/MyDrive/Capstone/bert_tokenized_data.pkl", "rb") as f:
#     bert_data = pickle.load(f)

# # Assign to variables
# X_train_bert_embedding      = bert_data["X_train_bert_embedding"]
# X_train_bt_bert_embedding   = bert_data["X_train_bt_bert_embedding"]
# y_train_bt_bert_embedding   = bert_data["y_train_bt_bert_embedding"]
# X_train_nlp_bert_embedding  = bert_data["X_train_nlp_bert_embedding"]
# y_train_nlp_bert_embedding  = bert_data["y_train_nlp_bert_embedding"]
# X_test_bert_embedding       = bert_data["X_test_bert_embedding"]

# X_train_bert_hstate         = bert_data["X_train_bert_hstate"]
# X_train_bt_bert_hstate      = bert_data["X_train_bt_bert_hstate"]
# y_train_bt_bert_hstate      = bert_data["y_train_bt_bert_hstate"]
# X_train_nlp_bert_hstate     = bert_data["X_train_nlp_bert_hstate"]
# y_train_nlp_bert_hstate     = bert_data["y_train_nlp_bert_hstate"]
# X_test_bert_hstate          = bert_data["X_test_bert_hstate"]

Artificial Neural Network (ANN)¶

In [85]:
import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Embedding, Dense, Dropout, BatchNormalization, GlobalAveragePooling1D
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import random
import matplotlib.pyplot as plt

# Custom focal loss function
def focal_loss(gamma=2.0, alpha=0.25):
    def focal_loss_fn(y_true, y_pred):
        y_true = tf.cast(y_true, tf.int32)
        y_true = tf.one_hot(y_true, depth=tf.shape(y_pred)[-1])
        epsilon = tf.keras.backend.epsilon()
        y_pred = tf.clip_by_value(y_pred, epsilon, 1. - epsilon)

        cross_entropy = -y_true * tf.math.log(y_pred)
        weight = alpha * y_true * tf.pow(1 - y_pred, gamma)
        loss = weight * cross_entropy
        return tf.reduce_mean(tf.reduce_sum(loss, axis=1))
    return focal_loss_fn


def ann_model(X_train, y_train, X_test, y_test, embedding_matrix=None, embedding_type='glove', run_index=None, n_iter=10):
    y_train = y_train.to_numpy() if hasattr(y_train, 'to_numpy') else y_train
    y_test = y_test.to_numpy() if hasattr(y_test, 'to_numpy') else y_test
    num_classes = len(np.unique(y_train))

    # Only tune non-architectural hyperparameters
    # param_dist = {
    #     'dense_dropout': [0.2, 0.3, 0.4,0.1,0.05],
    #     'learning_rate': [0.0001, 0.0005, 0.001],
    #     'activation': ['swish', 'relu', 'tanh'],
    #     'batch_size': [32, 64,16,50],
    #     'gamma': [1.0, 2.0],
    #     'alpha': [0.25, 0.5, 0.75]
    # }

      # Only tune non-architectural hyperparameters
    param_dist = {
        'dense_dropout': [0.1],
        'learning_rate': [0.0001],
        'activation': ['swish'],
        'batch_size': [32],
        'gamma': [2.0],
        'alpha': [ 0.75]
    }

    best_val_accuracy = 0
    best_model = None
    best_y_pred = None
    best_history = None
    best_params = None
    search_results = []

    for i in range(n_iter):
        params = {key: random.choice(values) for key, values in param_dist.items()}

        model = Sequential()

        if embedding_matrix is not None:
            model.add(Embedding(input_dim=embedding_matrix.shape[0],
                                output_dim=embedding_matrix.shape[1],
                                weights=[embedding_matrix],
                                input_length=X_train.shape[1],
                                trainable=True))
            model.add(GlobalAveragePooling1D())
        else:
            model.add(Dense(64, activation=params['activation'], input_shape=(X_train.shape[1],)))

        model.add(BatchNormalization())
        model.add(Dense(32, activation=params['activation']))
        model.add(Dropout(params['dense_dropout']))
        model.add(Dense(16, activation=params['activation']))
        model.add(Dropout(params['dense_dropout']))
        model.add(Dense(num_classes, activation='softmax'))

        loss = focal_loss(gamma=params['gamma'], alpha=params['alpha'])
        optimizer = tf.keras.optimizers.RMSprop(learning_rate=params['learning_rate'])

        model.compile(
            optimizer=optimizer,
            loss=loss,
            metrics=[
                tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy'),
                tfa.metrics.HammingLoss(mode='multiclass', name='hamming_loss')
            ]
        )

        class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
        class_weight_dict = dict(zip(np.unique(y_train), class_weights))

        callbacks = [
            EarlyStopping(monitor='val_accuracy', patience=10, restore_best_weights=True),
            ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, min_lr=1e-6)
        ]

        history = model.fit(
            X_train, y_train,
            validation_data=(X_test, y_test),
            epochs=50,
            batch_size=params['batch_size'],
            class_weight=class_weight_dict,
            callbacks=callbacks,
            verbose=0
        )

        val_accuracy = max(history.history['val_accuracy'])
        y_pred = np.argmax(model.predict(X_test, verbose=0), axis=1)
        y_pred_train = np.argmax(model.predict(X_train, verbose=0), axis=1)

        f1 = f1_score(y_test, y_pred, average='macro')
        acc = accuracy_score(y_test, y_pred)
        prec = precision_score(y_test, y_pred, average='macro', zero_division=0)
        rec = recall_score(y_test, y_pred, average='macro', zero_division=0)

        search_results.append((f1, acc, prec, rec, params['learning_rate'], params['dense_dropout'],
                               params['activation'], params['batch_size'], params['gamma'], params['alpha']))

        if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            best_model = model
            best_y_pred = y_pred
            best_history = history
            best_params = params

    if run_index is None:
        print("\n🎯 Random Search Summary (Fixed Layer Architecture):")
        best_result = sorted(search_results, key=lambda x: x[0], reverse=True)[0]
        print(f"Best model -> F1 Macro: {best_result[0]:.4f}, Acc: {best_result[1]:.4f}, "
              f"Prec: {best_result[2]:.4f}, Rec: {best_result[3]:.4f}, "
              f"lr={best_result[4]}, dDO={best_result[5]}, act={best_result[6]}, "
              f"batch={best_result[7]}, gamma={best_result[8]}, alpha={best_result[9]}")
        print(f"Best parameters: {best_params}")
        print(f"Best validation accuracy: {best_val_accuracy:.4f}")
    NN_performance_check(best_model, y_train, y_pred_train, y_test, best_y_pred,
                                 best_history, model_name='ANN', embedding_name=embedding_type)
    return best_model, best_y_pred, best_history, best_params

ANN with Glove embedding¶

In [86]:
print("======================== 🔹 Training ANN with GloVe Embedding - Original Data 🔹 ========================")
best_model_ann_glove_original_data , best_pred_ann_glove_original_data, best_history_ann_glove_original_data, best_params_ann_glove_original_data = ann_model(X_train_tokenized, y_train, X_test_tokenized, y_test, glove_embedding_matrix(train_tokenizer), 'GloVe Embedding - Original Data')



print("======================== 🔹 Training ANN with GloVe Embedding - Backtranslated Data 🔹 ========================")
best_model_ann_glove_BT_data , best_pred_ann_glove_BT_data, best_history_ann_glove_BT_data, best_params_ann_glove_BT_data= ann_model(X_train_bt_tokenized, y_train_bt_tokenized, X_test_tokenized, y_test, glove_embedding_matrix(train_tokenizer), 'GloVe Embedding - Backtranslated Data')



print("======================== 🔹 Training ANN with GloVe Embedding - NLP Augmented Data 🔹 ========================")
best_model_ann_glove_nlp_data , best_pred_ann_glove_nlp_data, best_history_ann_glove_nlp_data, best_params_ann_glove_nlp_data=ann_model(X_train_nlp_tokenized, y_train_nlp_tokenized, X_test_tokenized, y_test, glove_embedding_matrix(train_tokenizer), 'GloVe Embedding - NLP Augmented Data')
======================== 🔹 Training ANN with GloVe Embedding - Original Data 🔹 ========================
OOV words: 63/1286 (4.90%)
Shape of glove embedding matrix : (1287, 100)

🎯 Random Search Summary (Fixed Layer Architecture):
Best model -> F1 Macro: 0.2282, Acc: 0.7500, Prec: 0.3494, Rec: 0.2333, lr=0.0001, dDO=0.1, act=swish, batch=32, gamma=2.0, alpha=0.75
Best parameters: {'dense_dropout': 0.1, 'learning_rate': 0.0001, 'activation': 'swish', 'batch_size': 32, 'gamma': 2.0, 'alpha': 0.75}
Best validation accuracy: 0.7500

🔍 ANN Accuracy on Test Data: 0.7500

Classification Report:

              precision    recall  f1-score   support

           0       0.75      1.00      0.86        62
           1       0.00      0.00      0.00         8
           2       1.00      0.17      0.29         6
           3       0.00      0.00      0.00         6
           4       0.00      0.00      0.00         2

    accuracy                           0.75        84
   macro avg       0.35      0.23      0.23        84
weighted avg       0.62      0.75      0.65        84

Accuracy : 0.75
Precision (macro) : 0.3493975903614458
Precision (avg): 0.6227768215720023
Recall : 0.75
F1-Score : 0.6516068496364064
No description has been provided for this image
No description has been provided for this image
======================== 🔹 Training ANN with GloVe Embedding - Backtranslated Data 🔹 ========================
OOV words: 63/1286 (4.90%)
Shape of glove embedding matrix : (1287, 100)

🎯 Random Search Summary (Fixed Layer Architecture):
Best model -> F1 Macro: 0.2744, Acc: 0.5238, Prec: 0.2740, Rec: 0.3040, lr=0.0001, dDO=0.1, act=swish, batch=32, gamma=2.0, alpha=0.75
Best parameters: {'dense_dropout': 0.1, 'learning_rate': 0.0001, 'activation': 'swish', 'batch_size': 32, 'gamma': 2.0, 'alpha': 0.75}
Best validation accuracy: 0.5238

🔍 ANN Accuracy on Test Data: 0.5238

Classification Report:

              precision    recall  f1-score   support

           0       0.75      0.65      0.70        62
           1       0.12      0.38      0.18         8
           2       0.00      0.00      0.00         6
           3       0.00      0.00      0.00         6
           4       0.50      0.50      0.50         2

    accuracy                           0.52        84
   macro avg       0.27      0.30      0.27        84
weighted avg       0.58      0.52      0.54        84

Accuracy : 0.5238095238095238
Precision (macro) : 0.2740203193033382
Precision (avg): 0.5799467827769714
Recall : 0.5238095238095238
F1-Score : 0.5421690415296554
No description has been provided for this image
No description has been provided for this image
======================== 🔹 Training ANN with GloVe Embedding - NLP Augmented Data 🔹 ========================
OOV words: 63/1286 (4.90%)
Shape of glove embedding matrix : (1287, 100)

🎯 Random Search Summary (Fixed Layer Architecture):
Best model -> F1 Macro: 0.2294, Acc: 0.7500, Prec: 0.3512, Rec: 0.2333, lr=0.0001, dDO=0.1, act=swish, batch=32, gamma=2.0, alpha=0.75
Best parameters: {'dense_dropout': 0.1, 'learning_rate': 0.0001, 'activation': 'swish', 'batch_size': 32, 'gamma': 2.0, 'alpha': 0.75}
Best validation accuracy: 0.7500

🔍 ANN Accuracy on Test Data: 0.7500

Classification Report:

              precision    recall  f1-score   support

           0       0.76      1.00      0.86        62
           1       0.00      0.00      0.00         8
           2       1.00      0.17      0.29         6
           3       0.00      0.00      0.00         6
           4       0.00      0.00      0.00         2

    accuracy                           0.75        84
   macro avg       0.35      0.23      0.23        84
weighted avg       0.63      0.75      0.66        84

Accuracy : 0.75
Precision (macro) : 0.35121951219512193
Precision (avg): 0.6295005807200929
Recall : 0.75
F1-Score : 0.6559901738473167
No description has been provided for this image
No description has been provided for this image

ANN with BERT Tokenized Data¶

In [87]:
print(' ======================== 🔹 Training ANN with BERT Tokenized Data - Original 🔹 ========================')
best_model_ann_bert_original, best_pred_ann_bert_original, best_history_ann_bert_original, best_params_ann_bert_original = ann_model(
    X_train_bert_embedding, y_train, X_test_bert_embedding, y_test, None, 'BERT Tokenized Data - Original'
)

print(' ======================== 🔹 Training ANN with BERT Tokenized Data - Backtranslated 🔹 ========================')
best_model_ann_bert_bt, best_pred_ann_bert_bt, best_history_ann_bert_bt, best_params_ann_bert_bt = ann_model(
    X_train_bt_bert_embedding, y_train_bt_bert_embedding, X_test_bert_embedding, y_test, None, 'BERT Tokenized Data - Backtranslated'
)

print(' ======================== 🔹 Training ANN with BERT Tokenized Data - NLP Augmented 🔹 ========================')
best_model_ann_bert_nlp_aug, best_pred_ann_bert_nlp_aug, best_history_ann_bert_nlp_aug, best_params_ann_bert_nlp_aug = ann_model(
    X_train_nlp_bert_embedding, y_train_nlp_bert_embedding, X_test_bert_embedding, y_test, None, 'BERT Tokenized Data - NLP Augmented'
)
 ======================== 🔹 Training ANN with BERT Tokenized Data - Original 🔹 ========================

🎯 Random Search Summary (Fixed Layer Architecture):
Best model -> F1 Macro: 0.3045, Acc: 0.7738, Prec: 0.4883, Rec: 0.2833, lr=0.0001, dDO=0.1, act=swish, batch=32, gamma=2.0, alpha=0.75
Best parameters: {'dense_dropout': 0.1, 'learning_rate': 0.0001, 'activation': 'swish', 'batch_size': 32, 'gamma': 2.0, 'alpha': 0.75}
Best validation accuracy: 0.7738

🔍 ANN Accuracy on Test Data: 0.7738

Classification Report:

              precision    recall  f1-score   support

           0       0.78      1.00      0.87        62
           1       0.67      0.25      0.36         8
           2       1.00      0.17      0.29         6
           3       0.00      0.00      0.00         6
           4       0.00      0.00      0.00         2

    accuracy                           0.77        84
   macro avg       0.49      0.28      0.30        84
weighted avg       0.71      0.77      0.70        84

Accuracy : 0.7738095238095238
Precision (macro) : 0.4883333333333333
Precision (avg): 0.7069444444444445
Recall : 0.7738095238095238
F1-Score : 0.6995740677833232
No description has been provided for this image
No description has been provided for this image
 ======================== 🔹 Training ANN with BERT Tokenized Data - Backtranslated 🔹 ========================

🎯 Random Search Summary (Fixed Layer Architecture):
Best model -> F1 Macro: 0.3152, Acc: 0.4405, Prec: 0.3044, Rec: 0.4685, lr=0.0001, dDO=0.1, act=swish, batch=32, gamma=2.0, alpha=0.75
Best parameters: {'dense_dropout': 0.1, 'learning_rate': 0.0001, 'activation': 'swish', 'batch_size': 32, 'gamma': 2.0, 'alpha': 0.75}
Best validation accuracy: 0.6905

🔍 ANN Accuracy on Test Data: 0.6905

Classification Report:

              precision    recall  f1-score   support

           0       0.74      0.94      0.83        62
           1       0.00      0.00      0.00         8
           2       0.00      0.00      0.00         6
           3       0.00      0.00      0.00         6
           4       0.00      0.00      0.00         2

    accuracy                           0.69        84
   macro avg       0.15      0.19      0.17        84
weighted avg       0.55      0.69      0.61        84

Accuracy : 0.6904761904761905
Precision (macro) : 0.14871794871794872
Precision (avg): 0.5488400488400489
Recall : 0.6904761904761905
F1-Score : 0.6115646258503402
No description has been provided for this image
No description has been provided for this image
 ======================== 🔹 Training ANN with BERT Tokenized Data - NLP Augmented 🔹 ========================

🎯 Random Search Summary (Fixed Layer Architecture):
Best model -> F1 Macro: 0.2123, Acc: 0.7262, Prec: 0.2148, Rec: 0.2269, lr=0.0001, dDO=0.1, act=swish, batch=32, gamma=2.0, alpha=0.75
Best parameters: {'dense_dropout': 0.1, 'learning_rate': 0.0001, 'activation': 'swish', 'batch_size': 32, 'gamma': 2.0, 'alpha': 0.75}
Best validation accuracy: 0.7500

🔍 ANN Accuracy on Test Data: 0.7500

Classification Report:

              precision    recall  f1-score   support

           0       0.77      1.00      0.87        62
           1       0.33      0.12      0.18         8
           2       0.00      0.00      0.00         6
           3       0.00      0.00      0.00         6
           4       0.00      0.00      0.00         2

    accuracy                           0.75        84
   macro avg       0.22      0.23      0.21        84
weighted avg       0.60      0.75      0.66        84

Accuracy : 0.75
Precision (macro) : 0.21975308641975308
Precision (avg): 0.5967078189300411
Recall : 0.75
F1-Score : 0.6573426573426573
No description has been provided for this image
No description has been provided for this image

Performance evaluation & Conclusion (ANN)¶

In [88]:
result_df_final[-6:]
Out[88]:
Classifier Train Accuracy Test Accuracy Train F1 Score(avg) Test F1 Score(avg) Train F1 Score(macro) Test F1 Score(macro) Train Recall Test Recall Train Precision (avg) Test Precision (avg) Test Time cv score
0 ANN with GloVe Embedding - Original Data 0.730539 0.750000 0.646910 0.651607 0.232316 0.228177 0.730539 0.750000 0.607719 0.622777 - -
1 ANN with GloVe Embedding - Backtranslated Data 0.126613 0.523810 0.122897 0.542169 0.122862 0.274425 0.126613 0.523810 0.128619 0.579947 - -
2 ANN with GloVe Embedding - NLP Augmented Data 0.466119 0.750000 0.326085 0.655990 0.132519 0.229365 0.466119 0.750000 0.254816 0.629501 - -
3 ANN with BERT Tokenized Data - Original 0.739521 0.773810 0.629868 0.699574 0.170345 0.304518 0.739521 0.773810 0.548534 0.706944 - -
4 ANN with BERT Tokenized Data - Backtranslated 0.242742 0.690476 0.149064 0.611565 0.149130 0.165714 0.242742 0.690476 0.141820 0.548840 - -
5 ANN with BERT Tokenized Data - NLP Augmented 0.509240 0.750000 0.362239 0.657343 0.159603 0.209790 0.509240 0.750000 0.331455 0.596708 - -

Recurrent Neural Network (RNN)¶

In [4]:
def bi_rnn_model(X_train, y_train, X_test, y_test, embedding_matrix=None, embedding_name=None, run_index=None, n_iter=20):
    import tensorflow as tf
    import tensorflow_addons as tfa
    import random, os, json
    import numpy as np
    from sklearn.utils.class_weight import compute_class_weight
    from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Input, Dense, Dropout, SimpleRNN, BatchNormalization, Embedding, SpatialDropout1D
    from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
    from tensorflow.keras.initializers import get as get_initializer

    y_train = y_train.to_numpy() if hasattr(y_train, 'to_numpy') else y_train
    y_test = y_test.to_numpy() if hasattr(y_test, 'to_numpy') else y_test
    num_classes = len(np.unique(y_train))
    best_val_accuracy = 0
    best_model = None
    best_y_pred = None
    best_history = None
    best_params = None
    search_results = []
    param_dist = {
        'dense_dropout': [0.2, 0.3, 0.1],
        'learning_rate': [0.001, 0.0005,0.003],
        'activation': ['relu', 'tanh', 'swish'],
        'initializer': ['orthogonal', 'glorot_uniform', 'he_normal']
    }

    COMBO_LOG_PATH = "tried_combinations_fixed_rnn.json"
    if os.path.exists(COMBO_LOG_PATH):
        with open(COMBO_LOG_PATH, "r") as f:
            tried_combinations = json.load(f)
    else:
        tried_combinations = []

    best_val_accuracy = 0
    best_model = None
    best_y_pred = None
    best_history = None
    best_params = None

    for _ in range(n_iter):
        attempts = 0
        while attempts < 1000:
            params = {k: random.choice(v) for k, v in param_dist.items()}
            combo_key = json.dumps(params, sort_keys=True)
            if combo_key not in tried_combinations:
                tried_combinations.append(combo_key)
                break
            attempts += 1
        else:
            print("⚠️ Too many duplicate attempts. Stopping.")
            break

        model = Sequential()

        # Handle embedding (2D input)
        if embedding_matrix is not None:
            model.add(Input(shape=(X_train.shape[1],)))
            model.add(Embedding(
                input_dim=embedding_matrix.shape[0],
                output_dim=embedding_matrix.shape[1],
                weights=[embedding_matrix],
                input_length=X_train.shape[1],
                trainable=True
            ))
            model.add(SpatialDropout1D(0.2))
        else:
            # Convert 2D input to 3D (reshape)
            if len(X_train.shape) == 2:
                features = X_train.shape[1]
                for t in [7, 5, 2, 1]:
                    if features % t == 0:
                        time_steps = t
                        break
                feature_per_step = features // time_steps
                X_train = X_train.reshape((X_train.shape[0], time_steps, feature_per_step))
                X_test = X_test.reshape((X_test.shape[0], time_steps, feature_per_step))

            model.add(Input(shape=(X_train.shape[1], X_train.shape[2])))
            model.add(SpatialDropout1D(0.2))

        model.add(SimpleRNN(64, return_sequences=True, kernel_initializer=get_initializer(params['initializer'])))
        model.add(BatchNormalization())
        model.add(SimpleRNN(32, kernel_initializer=get_initializer(params['initializer'])))
        model.add(Dense(32, activation=params['activation'], kernel_initializer=get_initializer(params['initializer'])))
        model.add(Dropout(params['dense_dropout']))
        model.add(Dense(num_classes, activation='softmax'))

        loss = sparse_categorical_focal_loss(gamma=1., alpha=0.5)
        optimizer = tf.keras.optimizers.Adam(learning_rate=params['learning_rate'])

        model.compile(
            optimizer=optimizer,
            loss=loss,
            metrics=[
                tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy'),
                tfa.metrics.HammingLoss(mode='multiclass', name='hamming_loss')
            ]
        )

        class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
        class_weight_dict = dict(zip(np.unique(y_train), class_weights))

        callbacks = [
            EarlyStopping(monitor='val_accuracy', patience=5, restore_best_weights=True),
            ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, min_lr=1e-6)
        ]

        history = model.fit(
            X_train, y_train,
            validation_data=(X_test, y_test),
            epochs=15,
            batch_size=32,
            class_weight=class_weight_dict,
            callbacks=callbacks,
            verbose=0
        )

       # Get validation accuracy and other metrics
        val_accuracy = max(history.history['val_accuracy'])
        y_pred = np.argmax(model.predict(X_test, verbose=0), axis=1)
        f1 = f1_score(y_test, y_pred, average='macro')
        acc = accuracy_score(y_test, y_pred)
        prec = precision_score(y_test, y_pred, average='macro', zero_division=0)
        rec = recall_score(y_test, y_pred, average='macro', zero_division=0)

        # Store results
        search_results.append((f1, acc, prec, rec, params['learning_rate'], params['dense_dropout'],
                       params['activation']))


        # Update best model if current is better
        if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            best_model = model
            best_y_pred = y_pred
            best_history = history
            best_params = params

    # Print random search summary
    if run_index is None:
        print("\n🎯 Random Search Summary:")
        best_result = sorted(search_results, key=lambda x: x[0], reverse=True)[0]
        best_result = sorted(search_results, key=lambda x: x[0], reverse=True)[0]
        print(f"Best model -> F1 Macro: {best_result[0]:.4f}, Acc: {best_result[1]:.4f}, "
        f"Prec: {best_result[2]:.4f}, Rec: {best_result[3]:.4f}, "
        f"lr={best_result[4]}, dDO={best_result[5]}, act={best_result[6]}")

        print(f"Best parameters: {best_params}")
        print(f"Best validation accuracy: {best_val_accuracy:.4f}")

        # Try performance check if function is defined
        try:
            y_pred_train = np.argmax(best_model.predict(X_train, verbose=0), axis=1)
            NN_performance_check(best_model, y_train, y_pred_train, y_test, best_y_pred,
                                 best_history, model_name='RNN', embedding_name=embedding_type)
        except NameError:
            print("NN_performance_check not defined. Skipping performance check.")

    # Plotting results across iterations
    f1_macros = [result[0] for result in search_results]
    accuracies = [result[1] for result in search_results]
    precisions = [result[2] for result in search_results]
    recalls = [result[3] for result in search_results]

    plt.figure(figsize=(16, 8))
    plt.subplot(2, 2, 1)
    plt.plot(range(1, n_iter+1), f1_macros, marker='o', label='F1 Macro')
    plt.title('Macro F1 over Iterations')
    plt.xlabel('Iteration #')
    plt.ylabel('F1 Macro')
    plt.grid(True)

    plt.subplot(2, 2, 2)
    plt.plot(range(1, n_iter+1), accuracies, marker='o', label='Accuracy')
    plt.title('Accuracy over Iterations')
    plt.xlabel('Iteration #')
    plt.ylabel('Accuracy')
    plt.grid(True)

    plt.subplot(2, 2, 3)
    plt.plot(range(1, n_iter+1), precisions, marker='o', label='Precision', color='orange')
    plt.title('Precision over Iterations')
    plt.xlabel('Iteration #')
    plt.ylabel('Precision')
    plt.grid(True)

    plt.subplot(2, 2, 4)
    plt.plot(range(1, n_iter+1), recalls, marker='o', label='Recall', color='green')
    plt.title('Recall over Iterations')
    plt.xlabel('Iteration #')
    plt.ylabel('Recall')
    plt.grid(True)

    plt.tight_layout()
    plt.savefig('random_search_metrics.png')
    plt.show()

    return best_model, best_y_pred, best_history, best_params

RNN with GloVe Embedding¶

In [89]:
print(' ======================== 🔹 Training RNN with GloVe Embedding - Original Data 🔹 ========================')
best_model_rnn_glove_original, best_pred_rnn_glove_original, best_history_rnn_glove_original, best_params_rnn_glove_original = bi_rnn_model(
    X_train_tokenized, y_train, X_test_tokenized, y_test,
    glove_embedding_matrix(train_tokenizer),
    'GloVe Embedding - Original Data'
)

print(' ======================== 🔹 Training RNN with GloVe Embedding - Backtranslated Data 🔹 ========================')
best_model_rnn_glove_bt, best_pred_rnn_glove_bt, best_history_rnn_glove_bt, best_params_rnn_glove_bt = bi_rnn_model(
    X_train_bt_tokenized, y_train_bt_tokenized, X_test_tokenized, y_test,
    glove_embedding_matrix(train_tokenizer),
    'GloVe Embedding - Backtranslated Data'
)

print(' ======================== 🔹 Training RNN with GloVe Embedding - NLP Augmented Data 🔹 ========================')
best_model_rnn_glove_nlp, best_pred_rnn_glove_nlp, best_history_rnn_glove_nlp, best_params_rnn_glove_nlp = bi_rnn_model(
    X_train_nlp_tokenized, y_train_nlp_tokenized, X_test_tokenized, y_test,
    glove_embedding_matrix(train_tokenizer),
    'GloVe Embedding - NLP Augmented Data'
)
 ======================== 🔹 Training RNN with GloVe Embedding - Original Data 🔹 ========================
OOV words: 63/1286 (4.90%)
Shape of glove embedding matrix : (1287, 100)
WARNING:tensorflow:5 out of the last 29 calls to <function Model.make_predict_function.<locals>.predict_function at 0x78874df73d80> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
WARNING:tensorflow:5 out of the last 13 calls to <function Model.make_predict_function.<locals>.predict_function at 0x78866fad8a40> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
🎯 Random Search Summary:
Best model -> F1 Macro: 0.2325, Acc: 0.2619, Prec: 0.3035, Rec: 0.4559, lr=0.0005, dDO=0.1, act=relu
Best parameters: {'dense_dropout': 0.2, 'learning_rate': 0.001, 'activation': 'tanh', 'initializer': 'he_normal'}
Best validation accuracy: 0.7381
NN_performance_check not defined. Skipping performance check.
No description has been provided for this image
 ======================== 🔹 Training RNN with GloVe Embedding - Backtranslated Data 🔹 ========================
OOV words: 63/1286 (4.90%)
Shape of glove embedding matrix : (1287, 100)

🎯 Random Search Summary:
Best model -> F1 Macro: 0.3049, Acc: 0.7381, Prec: 0.4179, Rec: 0.2820, lr=0.0005, dDO=0.2, act=relu
Best parameters: {'dense_dropout': 0.3, 'learning_rate': 0.001, 'activation': 'swish', 'initializer': 'orthogonal'}
Best validation accuracy: 0.7381
NN_performance_check not defined. Skipping performance check.
No description has been provided for this image
 ======================== 🔹 Training RNN with GloVe Embedding - NLP Augmented Data 🔹 ========================
OOV words: 63/1286 (4.90%)
Shape of glove embedding matrix : (1287, 100)

🎯 Random Search Summary:
Best model -> F1 Macro: 0.2434, Acc: 0.7500, Prec: 0.2840, Rec: 0.2468, lr=0.001, dDO=0.2, act=tanh
Best parameters: {'dense_dropout': 0.2, 'learning_rate': 0.001, 'activation': 'tanh', 'initializer': 'he_normal'}
Best validation accuracy: 0.7500
NN_performance_check not defined. Skipping performance check.
No description has been provided for this image

RNN with BERT Tokenized Data¶

In [90]:
print(' ======================== 🔹 Training RNN with BERT Tokenized Data - Original 🔹 ========================')
best_model_rnn_bert_original, best_pred_rnn_bert_original, best_history_rnn_bert_original, best_params_rnn_bert_original = bi_rnn_model(
    X_train_bert_hstate, y_train, X_test_bert_hstate, y_test, None, 'BERT Tokenized Data - Original'
)

print(' ======================== 🔹 Training RNN with BERT Tokenized Data - Backtranslated 🔹 ========================')
best_model_rnn_bert_bt, best_pred_rnn_bert_bt, best_history_rnn_bert_bt, best_params_rnn_bert_bt = bi_rnn_model(
    X_train_bt_bert_hstate, y_train_bt_bert_hstate, X_test_bert_hstate, y_test, None, 'BERT Tokenized Data - Backtranslated'
)

print(' ======================== 🔹 Training RNN with BERT Tokenized Data - NLP Augmented 🔹 ========================')
best_model_rnn_bert_nlp, best_pred_rnn_bert_nlp, best_history_rnn_bert_nlp, best_params_rnn_bert_nlp = bi_rnn_model(
    X_train_nlp_bert_hstate, y_train_nlp_bert_hstate, X_test_bert_hstate, y_test, None, 'BERT Tokenized Data - NLP Augmented'
)
 ======================== 🔹 Training RNN with BERT Tokenized Data - Original 🔹 ========================

🎯 Random Search Summary:
Best model -> F1 Macro: 0.3043, Acc: 0.7143, Prec: 0.3222, Rec: 0.2973, lr=0.001, dDO=0.3, act=relu
Best parameters: {'dense_dropout': 0.2, 'learning_rate': 0.003, 'activation': 'tanh', 'initializer': 'glorot_uniform'}
Best validation accuracy: 0.7262
NN_performance_check not defined. Skipping performance check.
No description has been provided for this image
 ======================== 🔹 Training RNN with BERT Tokenized Data - Backtranslated 🔹 ========================

🎯 Random Search Summary:
Best model -> F1 Macro: 0.2282, Acc: 0.7500, Prec: 0.3494, Rec: 0.2333, lr=0.003, dDO=0.3, act=tanh
Best parameters: {'dense_dropout': 0.3, 'learning_rate': 0.003, 'activation': 'tanh', 'initializer': 'glorot_uniform'}
Best validation accuracy: 0.7500
NN_performance_check not defined. Skipping performance check.
No description has been provided for this image
 ======================== 🔹 Training RNN with BERT Tokenized Data - NLP Augmented 🔹 ========================

🎯 Random Search Summary:
Best model -> F1 Macro: 0.2949, Acc: 0.6905, Prec: 0.4750, Rec: 0.2691, lr=0.0005, dDO=0.2, act=tanh
Best parameters: {'dense_dropout': 0.1, 'learning_rate': 0.001, 'activation': 'tanh', 'initializer': 'he_normal'}
Best validation accuracy: 0.7619
NN_performance_check not defined. Skipping performance check.
No description has been provided for this image

Performance evaluation & Conclusion (RNN)¶

In [91]:
result_df_final
Out[91]:
Classifier Train Accuracy Test Accuracy Train F1 Score(avg) Test F1 Score(avg) Train F1 Score(macro) Test F1 Score(macro) Train Recall Test Recall Train Precision (avg) Test Precision (avg) Test Time cv score
0 ANN with GloVe Embedding - Original Data 0.730539 0.750000 0.646910 0.651607 0.232316 0.228177 0.730539 0.750000 0.607719 0.622777 - -
1 ANN with GloVe Embedding - Backtranslated Data 0.126613 0.523810 0.122897 0.542169 0.122862 0.274425 0.126613 0.523810 0.128619 0.579947 - -
2 ANN with GloVe Embedding - NLP Augmented Data 0.466119 0.750000 0.326085 0.655990 0.132519 0.229365 0.466119 0.750000 0.254816 0.629501 - -
3 ANN with BERT Tokenized Data - Original 0.739521 0.773810 0.629868 0.699574 0.170345 0.304518 0.739521 0.773810 0.548534 0.706944 - -
4 ANN with BERT Tokenized Data - Backtranslated 0.242742 0.690476 0.149064 0.611565 0.149130 0.165714 0.242742 0.690476 0.141820 0.548840 - -
5 ANN with BERT Tokenized Data - NLP Augmented 0.509240 0.750000 0.362239 0.657343 0.159603 0.209790 0.509240 0.750000 0.331455 0.596708 - -

Long Short Term Memory (LSTM)¶

In [92]:
import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa
from sklearn.utils.class_weight import compute_class_weight
from sklearn.model_selection import ParameterSampler
from tensorflow.keras.layers import Input, Embedding, Dense, Dropout, Bidirectional, LSTM, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau


def bi_lstm_model(X_train, y_train, X_test, y_test, embedding_matrix=None, embedding_type='glove', run_index=None, n_iter=20):
    y_train = y_train.to_numpy() if hasattr(y_train, 'to_numpy') else y_train
    y_test = y_test.to_numpy() if hasattr(y_test, 'to_numpy') else y_test

    num_classes = len(np.unique(y_train))

    # Define parameter grid
    param_grid = {

        'dense_dropout': [0.2, 0.3, 0.4,0.1],
        'learning_rate': [0.0001, 0.0005, 0.001],
        'activation': ['swish', 'relu','tanh'],
        'batch_size': [32, 64]
    }

    # Sample random parameter combinations
    sampled_params = list(ParameterSampler(param_grid, n_iter=n_iter, random_state=42))

    best_val_accuracy = 0
    best_model = None
    best_y_pred = None
    best_history = None
    best_params = None

    for params in sampled_params:
        # Build model
        if embedding_matrix is not None:
            input_layer = Input(shape=(X_train.shape[1],))
            x = Embedding(
                input_dim=embedding_matrix.shape[0],
                output_dim=embedding_matrix.shape[1],
                weights=[embedding_matrix],
                input_length=X_train.shape[1],
                trainable=True)(input_layer)
            x = tf.keras.layers.SpatialDropout1D(0.1)(x)
        else:
            input_layer = Input(shape=(X_train.shape[1], X_train.shape[2]))
            x = input_layer

        x = Bidirectional(LSTM(64, return_sequences=True))(x)
        x = Dense(32, activation=params['activation'])(x)
        x = BatchNormalization()(x)
        x = Bidirectional(LSTM(16, return_sequences=False))(x)
        x = Dropout(params['dense_dropout'])(x)

        output_layer = Dense(num_classes, activation='softmax')(x)
        model = Model(inputs=input_layer, outputs=output_layer)

        # Compile model
        loss = tf.keras.losses.SparseCategoricalCrossentropy()
        optimizer = tf.keras.optimizers.RMSprop(learning_rate=params['learning_rate'], weight_decay=0.01)
        model.compile(
            optimizer=optimizer,
            loss=loss,
            metrics=[
                tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy'),
                tfa.metrics.HammingLoss(mode='multiclass', name='hamming_loss')
            ]
        )

        # Compute class weights
        class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
        class_weight_dict = dict(zip(np.unique(y_train), class_weights))

        # Callbacks
        callbacks = [
            EarlyStopping(monitor='val_accuracy', patience=7, restore_best_weights=True),
            ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, min_lr=1e-6)
        ]

        # Training
        history = model.fit(
            X_train, y_train,
            validation_data=(X_test, y_test),
            epochs=15,
            batch_size=35,
            class_weight=class_weight_dict,
            callbacks=callbacks,
            verbose=0
        )

        val_accuracy = max(history.history['val_accuracy'])

        if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            best_model = model
            best_y_pred = np.argmax(model.predict(X_test, verbose=0), axis=1)
            best_history = history
            best_params = params

    # Show performance for best model if not in batch run
    if run_index is None:
        print(f"\n🔹 Best Parameters Found: {best_params}")
        print(f"🔹 Best Validation Accuracy: {best_val_accuracy:.4f}")
        try:
            y_pred_train = np.argmax(best_model.predict(X_train, verbose=0), axis=1)
            NN_performance_check(
                best_model, y_train, y_pred_train,
                y_test, best_y_pred,
                best_history,
                model_name='Bi-LSTM',
                embedding_name=embedding_type
            )
        except NameError:
            print("⚠️ NN_performance_check not defined. Skipping performance metrics.")

    return best_model, best_y_pred, best_history, best_params

BiLSTM with GloVe Embedding¶

In [93]:
# Cache the embedding matrix to avoid repeated computation
glove_matrix = glove_embedding_matrix(train_tokenizer)

print(' ======================== 🔹 Training BiLSTM with GloVe Embedding - Original Data 🔹 ========================')
best_model_bilstm_glove_original, best_pred_bilstm_glove_original, best_history_bilstm_glove_original, best_params_bilstm_glove_original = bi_lstm_model(
    X_train_tokenized, y_train, X_test_tokenized, y_test,
    glove_matrix, 'GloVe Embedding - Original Data'
)

print(' ======================== 🔹 Training BiLSTM with GloVe Embedding - Backtranslated Data 🔹 ========================')
best_model_bilstm_glove_bt, best_pred_bilstm_glove_bt, best_history_bilstm_glove_bt, best_params_bilstm_glove_bt = bi_lstm_model(
    X_train_bt_tokenized, y_train_bt_tokenized, X_test_tokenized, y_test,
    glove_matrix, 'GloVe Embedding - Backtranslated Data'
)

print(' ======================== 🔹 Training BiLSTM with GloVe Embedding - NLP Augmented Data 🔹 ========================')
best_model_bilstm_glove_nlp, best_pred_bilstm_glove_nlp, best_history_bilstm_glove_nlp, best_params_bilstm_glove_nlp = bi_lstm_model(
    X_train_nlp_tokenized, y_train_nlp_tokenized, X_test_tokenized, y_test,
    glove_matrix, 'GloVe Embedding - NLP Augmented Data'
)
OOV words: 63/1286 (4.90%)
Shape of glove embedding matrix : (1287, 100)
 ======================== 🔹 Training BiLSTM with GloVe Embedding - Original Data 🔹 ========================

🔹 Best Parameters Found: {'learning_rate': 0.001, 'dense_dropout': 0.1, 'batch_size': 64, 'activation': 'relu'}
🔹 Best Validation Accuracy: 0.6548

🔍 Bi-LSTM Accuracy on Test Data: 0.4167

Classification Report:

              precision    recall  f1-score   support

           0       0.73      0.53      0.62        62
           1       0.03      0.12      0.05         8
           2       0.25      0.17      0.20         6
           3       0.00      0.00      0.00         6
           4       0.00      0.00      0.00         2

    accuracy                           0.42        84
   macro avg       0.20      0.16      0.17        84
weighted avg       0.56      0.42      0.47        84

Accuracy : 0.4166666666666667
Precision (macro) : 0.20311827956989248
Precision (avg): 0.5621991807475678
Recall : 0.4166666666666667
F1-Score : 0.47444341743407165
No description has been provided for this image
No description has been provided for this image
 ======================== 🔹 Training BiLSTM with GloVe Embedding - Backtranslated Data 🔹 ========================

🔹 Best Parameters Found: {'learning_rate': 0.001, 'dense_dropout': 0.2, 'batch_size': 64, 'activation': 'tanh'}
🔹 Best Validation Accuracy: 0.7381

🔍 Bi-LSTM Accuracy on Test Data: 0.7381

Classification Report:

              precision    recall  f1-score   support

           0       0.74      1.00      0.85        62
           1       0.00      0.00      0.00         8
           2       0.00      0.00      0.00         6
           3       0.00      0.00      0.00         6
           4       0.00      0.00      0.00         2

    accuracy                           0.74        84
   macro avg       0.15      0.20      0.17        84
weighted avg       0.54      0.74      0.63        84

Accuracy : 0.7380952380952381
Precision (macro) : 0.14761904761904762
Precision (avg): 0.5447845804988662
Recall : 0.7380952380952381
F1-Score : 0.6268754076973254
No description has been provided for this image
No description has been provided for this image
 ======================== 🔹 Training BiLSTM with GloVe Embedding - NLP Augmented Data 🔹 ========================

🔹 Best Parameters Found: {'learning_rate': 0.001, 'dense_dropout': 0.2, 'batch_size': 64, 'activation': 'tanh'}
🔹 Best Validation Accuracy: 0.7500

🔍 Bi-LSTM Accuracy on Test Data: 0.7500

Classification Report:

              precision    recall  f1-score   support

           0       0.76      1.00      0.86        62
           1       0.00      0.00      0.00         8
           2       1.00      0.17      0.29         6
           3       0.00      0.00      0.00         6
           4       0.00      0.00      0.00         2

    accuracy                           0.75        84
   macro avg       0.35      0.23      0.23        84
weighted avg       0.63      0.75      0.66        84

Accuracy : 0.75
Precision (macro) : 0.35121951219512193
Precision (avg): 0.6295005807200929
Recall : 0.75
F1-Score : 0.6559901738473167
No description has been provided for this image
No description has been provided for this image

BiLSTM with BERT Tokenized Data¶

In [94]:
print(' ======================== 🔹 Training BiLSTM with BERT Tokenized Data - Original 🔹 ========================')
best_model_bilstm_bert_original, best_pred_bilstm_bert_original, best_history_bilstm_bert_original, best_params_bilstm_bert_original = bi_lstm_model(
    X_train_bert_hstate, y_train, X_test_bert_hstate, y_test, None, 'BERT Tokenized Data - Original'
)

print(' ======================== 🔹 Training BiLSTM with BERT Tokenized Data - Backtranslated 🔹 ========================')
best_model_bilstm_bert_bt, best_pred_bilstm_bert_bt, best_history_bilstm_bert_bt, best_params_bilstm_bert_bt = bi_lstm_model(
    X_train_bt_bert_hstate, y_train_bt_bert_hstate, X_test_bert_hstate, y_test, None, 'BERT Tokenized Data - Backtranslated'
)

print(' ======================== 🔹 Training BiLSTM with BERT Tokenized Data - NLP Augmented 🔹 ========================')
best_model_bilstm_bert_nlp, best_pred_bilstm_bert_nlp, best_history_bilstm_bert_nlp, best_params_bilstm_bert_nlp = bi_lstm_model(
    X_train_nlp_bert_hstate, y_train_nlp_bert_hstate, X_test_bert_hstate, y_test, None, 'BERT Tokenized Data - NLP Augmented'
)
 ======================== 🔹 Training BiLSTM with BERT Tokenized Data - Original 🔹 ========================

🔹 Best Parameters Found: {'learning_rate': 0.0005, 'dense_dropout': 0.3, 'batch_size': 32, 'activation': 'relu'}
🔹 Best Validation Accuracy: 0.7262

🔍 Bi-LSTM Accuracy on Test Data: 0.7262

Classification Report:

              precision    recall  f1-score   support

           0       0.73      0.98      0.84        62
           1       0.00      0.00      0.00         8
           2       0.00      0.00      0.00         6
           3       0.00      0.00      0.00         6
           4       0.00      0.00      0.00         2

    accuracy                           0.73        84
   macro avg       0.15      0.20      0.17        84
weighted avg       0.54      0.73      0.62        84

Accuracy : 0.7261904761904762
Precision (macro) : 0.14698795180722893
Precision (avg): 0.5424555364314401
Recall : 0.7261904761904762
F1-Score : 0.6210180623973728
No description has been provided for this image
No description has been provided for this image
 ======================== 🔹 Training BiLSTM with BERT Tokenized Data - Backtranslated 🔹 ========================

🔹 Best Parameters Found: {'learning_rate': 0.0001, 'dense_dropout': 0.4, 'batch_size': 64, 'activation': 'swish'}
🔹 Best Validation Accuracy: 0.7619

🔍 Bi-LSTM Accuracy on Test Data: 0.7619

Classification Report:

              precision    recall  f1-score   support

           0       0.76      1.00      0.86        62
           1       0.00      0.00      0.00         8
           2       1.00      0.33      0.50         6
           3       0.00      0.00      0.00         6
           4       0.00      0.00      0.00         2

    accuracy                           0.76        84
   macro avg       0.35      0.27      0.27        84
weighted avg       0.63      0.76      0.67        84

Accuracy : 0.7619047619047619
Precision (macro) : 0.35121951219512193
Precision (avg): 0.6295005807200929
Recall : 0.7619047619047619
F1-Score : 0.6712962962962964
No description has been provided for this image
No description has been provided for this image
 ======================== 🔹 Training BiLSTM with BERT Tokenized Data - NLP Augmented 🔹 ========================

🔹 Best Parameters Found: {'learning_rate': 0.0005, 'dense_dropout': 0.1, 'batch_size': 32, 'activation': 'swish'}
🔹 Best Validation Accuracy: 0.7500

🔍 Bi-LSTM Accuracy on Test Data: 0.7500

Classification Report:

              precision    recall  f1-score   support

           0       0.77      1.00      0.87        62
           1       0.00      0.00      0.00         8
           2       0.00      0.00      0.00         6
           3       1.00      0.17      0.29         6
           4       0.00      0.00      0.00         2

    accuracy                           0.75        84
   macro avg       0.35      0.23      0.23        84
weighted avg       0.64      0.75      0.66        84

Accuracy : 0.75
Precision (macro) : 0.3530864197530864
Precision (avg): 0.6363903586125808
Recall : 0.75
F1-Score : 0.6604348032919461
No description has been provided for this image
No description has been provided for this image

Performance evaluation and conclusion (LSTM)¶

In [97]:
result_df_final
Out[97]:
Classifier Train Accuracy Test Accuracy Train F1 Score(avg) Test F1 Score(avg) Train F1 Score(macro) Test F1 Score(macro) Train Recall Test Recall Train Precision (avg) Test Precision (avg) Test Time cv score
0 ANN with GloVe Embedding - Original Data 0.730539 0.750000 0.646910 0.651607 0.232316 0.228177 0.730539 0.750000 0.607719 0.622777 - -
1 ANN with GloVe Embedding - Backtranslated Data 0.126613 0.523810 0.122897 0.542169 0.122862 0.274425 0.126613 0.523810 0.128619 0.579947 - -
2 ANN with GloVe Embedding - NLP Augmented Data 0.466119 0.750000 0.326085 0.655990 0.132519 0.229365 0.466119 0.750000 0.254816 0.629501 - -
3 ANN with BERT Tokenized Data - Original 0.739521 0.773810 0.629868 0.699574 0.170345 0.304518 0.739521 0.773810 0.548534 0.706944 - -
4 ANN with BERT Tokenized Data - Backtranslated 0.242742 0.690476 0.149064 0.611565 0.149130 0.165714 0.242742 0.690476 0.141820 0.548840 - -
5 ANN with BERT Tokenized Data - NLP Augmented 0.509240 0.750000 0.362239 0.657343 0.159603 0.209790 0.509240 0.750000 0.331455 0.596708 - -
6 Bi-LSTM with GloVe Embedding - Original Data 0.805389 0.416667 0.825086 0.474443 0.814702 0.173621 0.805389 0.416667 0.890301 0.562199 - -
7 Bi-LSTM with GloVe Embedding - Backtranslated ... 0.508871 0.738095 0.432445 0.626875 0.432723 0.169863 0.508871 0.738095 0.476405 0.544785 - -
8 Bi-LSTM with GloVe Embedding - NLP Augmented Data 0.622177 0.750000 0.503896 0.655990 0.285246 0.229365 0.622177 0.750000 0.545162 0.629501 - -
9 Bi-LSTM with BERT Tokenized Data - Original 0.829341 0.726190 0.790861 0.621018 0.657166 0.168276 0.829341 0.726190 0.861340 0.542456 - -
10 Bi-LSTM with BERT Tokenized Data - Backtranslated 0.878226 0.761905 0.878410 0.671296 0.878417 0.272222 0.878226 0.761905 0.894624 0.629501 - -
11 Bi-LSTM with BERT Tokenized Data - NLP Augmented 0.989733 0.750000 0.989642 0.660435 0.988212 0.230569 0.989733 0.750000 0.989915 0.636390 - -

Final Result Summary & Conclusion¶

In [96]:
result_df_final
Out[96]:
Classifier Train Accuracy Test Accuracy Train F1 Score(avg) Test F1 Score(avg) Train F1 Score(macro) Test F1 Score(macro) Train Recall Test Recall Train Precision (avg) Test Precision (avg) Test Time cv score
0 ANN with GloVe Embedding - Original Data 0.730539 0.750000 0.646910 0.651607 0.232316 0.228177 0.730539 0.750000 0.607719 0.622777 - -
1 ANN with GloVe Embedding - Backtranslated Data 0.126613 0.523810 0.122897 0.542169 0.122862 0.274425 0.126613 0.523810 0.128619 0.579947 - -
2 ANN with GloVe Embedding - NLP Augmented Data 0.466119 0.750000 0.326085 0.655990 0.132519 0.229365 0.466119 0.750000 0.254816 0.629501 - -
3 ANN with BERT Tokenized Data - Original 0.739521 0.773810 0.629868 0.699574 0.170345 0.304518 0.739521 0.773810 0.548534 0.706944 - -
4 ANN with BERT Tokenized Data - Backtranslated 0.242742 0.690476 0.149064 0.611565 0.149130 0.165714 0.242742 0.690476 0.141820 0.548840 - -
5 ANN with BERT Tokenized Data - NLP Augmented 0.509240 0.750000 0.362239 0.657343 0.159603 0.209790 0.509240 0.750000 0.331455 0.596708 - -
6 Bi-LSTM with GloVe Embedding - Original Data 0.805389 0.416667 0.825086 0.474443 0.814702 0.173621 0.805389 0.416667 0.890301 0.562199 - -
7 Bi-LSTM with GloVe Embedding - Backtranslated ... 0.508871 0.738095 0.432445 0.626875 0.432723 0.169863 0.508871 0.738095 0.476405 0.544785 - -
8 Bi-LSTM with GloVe Embedding - NLP Augmented Data 0.622177 0.750000 0.503896 0.655990 0.285246 0.229365 0.622177 0.750000 0.545162 0.629501 - -
9 Bi-LSTM with BERT Tokenized Data - Original 0.829341 0.726190 0.790861 0.621018 0.657166 0.168276 0.829341 0.726190 0.861340 0.542456 - -
10 Bi-LSTM with BERT Tokenized Data - Backtranslated 0.878226 0.761905 0.878410 0.671296 0.878417 0.272222 0.878226 0.761905 0.894624 0.629501 - -
11 Bi-LSTM with BERT Tokenized Data - NLP Augmented 0.989733 0.750000 0.989642 0.660435 0.988212 0.230569 0.989733 0.750000 0.989915 0.636390 - -

Final Models performance on multiple run¶

Index Model Test F1 Score (avg) Test Accuracy Overfitting?
3 ANN with BERT Tokenized Data - Original 0.6996 0.7738 Small gap
10 Bi-LSTM with BERT Tokenized Data - Backtranslated 0.6713 0.7619 Medium gap (Train F1 = 0.8784)
5 ANN with BERT Tokenized Data - NLP Augmented 0.6573 0.75 Medium gap
2 ANN with GloVe Embedding - NLP Augmented 0.6560 0.75 Large gap (Train F1 = 0.3261)
11 Bi-LSTM with BERT Tokenized Data - NLP Augmented 0.6604 0.75 High overfitting (Train F1 = 0.9896)
In [99]:
with open('/content/drive/MyDrive/Capstone/ANNBertOriginal_model.pkl', 'wb') as file:
     pickle.dump(best_model_ann_bert_original, file)